Tuesday, December 15, 2009

Hadoop 0.20: Creating Types

In various earlier posts, I wrote code to read and write zip code data (which happens to be part of the companion page to my book Data Analysis Using SQL and Excel). This provides sample data for use in my learning Hadoop and mapreduce.

Originally, I wrote the code using Hadoop 0.18, because I was using the Yahoo virtual machine. I have since switched to the Cloudera virtual machine, which runs the most recent version of Hadoop, V0.20.

I thought switching my code would be easy. The issue is less the difficulty of the switch, then some nuances in Hadoop and java. This post explains some of the differences between the two versions, when adding a new type into the system. I explained my experience with the map, reduce, and job interface in another post.

The structure of the code is simple. I have a java file that implements a class called ZipCode, which contains the ZipCode interface with the Writable interface (which is I include using import org.apache.hadoop.io.*). Another class called ZipCodeInputFormat implements the read/writable version so ZipCode can be used as input and output in MapReduce functions. The input format class uses another, private class called ZipCodeRecordReader, which does all the work. Because of the rules of java, these need to be in two different files, which have the same name as the class. The files are available in ZipCensus.java and ZipCensusInputFormat.java.

These files now use the Apache mapreduce interface rather than the mapred interface, so I must import the right packages into the java code:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.InputSplit;

And then I had a problem when defining the ZipCodeInputFormat class using the code:

public class ZipCensusInputFormat extends FileInputFormat {
....public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
........return new ZipCensusRecordReader();
....} // RecordReader
} // class ZipCensusInputFormat

The specific error given by Eclipse/Ganymede is: "The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files." This is a bug in Eclipse/Ganymede, because the code compiles and runs using javac/jar. At one point, I fixed this by including various Apache commons jars. However, since I didn't need them when compiling manually, I removed them from the Eclipse project.

The interface for the RecordReader class itself has changed. The definition for the class now looks like:

class ZipCensusRecordReader extends RecordReader

Previously, this used the syntax "implements" rather than "extends". For those familiar with java, this is the difference between an interface and an abstract class, a nuance I don't yet fully appreciate.

The new interface (no pun intended) includes two new functions, initialize() and cleanup() . I like this change, because it follows the same convention used for map and reduce classes.

As a result, I changed the constructor to take no arguments. This has moved to initialize(), which takes two arguments of type InputSplit and TaskAttemptContext. The purpose of this code is simply to skip the first line of the data file, which contains column names.

The most important for the class is now called nextKeyValue() rather than next(). The new function takes no arguments, putting the results in local private variables accessed using getCurrentKey() and getCurrentValue(). The function next() took two arguments, one for the key and one for the value, although the results could be accessed using the same two functions.

Overall the changes are simple modifications to the interface, but they can be tricky for the new user. I did not find a simple explanation for the changes anywhere on the web; perhaps this posting will help someone else.

No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.