Wikipedia Quality/Hadoop

From Thumper

Jump to: navigation, search

We have a framework that reads Wikipedia XML files to reduce them down to some vital statistics, and then these statistics are used to do most of our analysis on the Wikipedia. For our recent work on measuring contribution quality, we had to run the second stage (analysis) over and over to make sure we got the code right; it takes four hours to process the statistics file for the English Wikipedia. The first stage (statistics gathering) is even worse: on a single processor, we estimate that it takes about a month to reduce the English Wikipedia down to a manageable form.

Those numbers have prompted me to think about how to distribute our code more gracefully. Ian has some code that he's cobbled together to do this on some SOE machines, but that won't likely help others who want to use our code. I'm thinking that the right solution is to modify our code to be compatible with Hadoop.

--BoAdler, 25 May 2008

Buggy Hadoop

I've been playing with Hadoop Streaming alot in the last week, and got some small toy instances running. But it turns out to have some issues...

  • it doesn't send an entire input file to a single mapper process
  • it sends the same XML record to multiple mapper processes; HADOOP-3484
  • it can't handle compressed XML input; HADOOP-3562

I've actually fixed the second two problems in my copy of the code. I'm struggling with how to contribute them back to Hadoop, which involves writing test cases and documentation and stuff. In case other people need these problems fixed, here are my interim patches:

You should download and apply all of them, to get StreamXmlRecordReader working properly.

--BoAdler 19:53, 18 June 2008 (PDT)

Someone found another test case for HADOOP-3484 that I had not considered, and supplemented the patch. I fixed up the test cases and reposted a patch at the JIRA site. I also submitted the patch to HADOOP-3562 at the website. I recommend downloading the latest patch upload directly from the JIRA site, for anyone interested in these patches.

Relatedly, I've been thinking alot about HADOOP-3484 and the idea of using the XML record reader to parse the Wikipedia data files. The problem I have with this technique is that all the revisions of a single article constitute a single record, which can be quite large. It doesn't make sense to read this all into memory at once, when the actual code can do its processing by reading in a sub-record at a time. The docs mention a related issue in the FAQ, and suggest creating a jobfile with one DFS filename per line and then having the mapper script fetch the file from DFS. Is there a way to do this efficiently? That is, can I use " mymapper.sh" to have the file streamed to my mapper script via the network? Or does this essentially copy the datafile to the destination machine first, and then stream to the mapper?

--BoAdler 23:40, 25 July 2008 (PDT)

Comments

Name (required):

Website:

Comment:


Talk:Wikipedia Quality/Hadoop

Personal tools