- I'm getting tired of using deprecated features -- it is time to learn the new interface.
- I would like to use some new features, specifically MultipleInputFormats.
- The Yahoo! Virtual Machine (which I recommended in my first post) is not maintained, whereas the Cloudera training machine is.
- And, for free software, I have so far found the Cloudera community support quite effective.
That said, there are a few downsides to the upgrade:
- The virtual machine has the most recent version of Eclipse (called Ganymede), which does not work with Hadoop.
- Hence, the virtual machine requires using command lines for compiling the java code.
- I haven't managed to get the virtual machine to share disks with the host (instead, I send source files through gmail).
Changes from Hadoop 0.18 to 0.20
The updated code with the Hadoop 0.20 API is in RowNumberTwoPass-0.20.java.
Perhaps the most noticeable change is the packages. Before 0.20, Hadoop used classes in a package called "mapred". Starting with 0.20, it uses classes in "mapreduce". These have a different interface, although it is pretty easy to switch from one to the other.
The reason for this change has to do with future development for Hadoop. This change will make it possible to separate releases of HDFS (the distributed file system) and releases of MapReduce. The following are packages that contain the new interface:
In the code itself, there are both subtle and major code differences. I have noticed the following changes in the Map and Reduce classes:
- The classes now longer need the "implements" syntax.
- The function called before the map/reduce is now called setup() rather than configure().
- The function called after the map/reduce is called cleanup().
- The functions all take an argument whose class is Context; this is used instead of Reporter and OutputCollector.
- The map and reduce functions can also throw InterruptedException.
There are a few other minor coding differences. However, the code follows the same logic as in 0.18, and the code ran the first time after I made the changes.
The Cloudera Virtual Machine
First, I should point out that I have no connection to Cloudera, which is a company that makes money (or intends to make money) by providing support and training for Hadoop.
The Cloudera Virtual Machine is available here. It requires running a VMWare virtual machine, which is available here. Between the two, these are about 1.5 Gbytes, so have a good internet connection when you want to download them.
The machine looks different from the Yahoo! VM, because it runs X rather than just a terminal interface. The desktop is pre-configured with a terminal, Eclipse, Firefox, and perhaps some other stuff. When I start the VM, I open the terminal and run emacs in the background. Emacs is a text editor that I know well from my days as a software programmer (more years ago than I care to admit). To use the VM, I would suggest that you have some facility with either emacs or VI.
The version of Hadoop is 0.20.1. Note that as new versions are released, Cloudera will probably introduce new virtual machines. Any work you do on this machine will be lost when you replace the VM with a newer version. As I said, I am sending source files back and forth via gmail. Perhaps you can get the VM to share disks with the host machine. The libraries for Hadoop are in /usr/lib/Hadoop-2.0.
Unfortunately, the version of Eclipse installed in the VM does not fully support Hadoop (if you want to see the bug reports, google something like "Hadoop Ganymede"). Fortunately, you can use Eclipse/Ganymede to write code, and it does full syntax checking. However, you'll have to compile and run the code outside the Eclipse environment. I believe this is a bug in this version of Eclipse, which will hopefully be fixed sometime in the near future.
I suppose that I could download the working version of Eclipse (Europe, which is, I think, version 3.2). But, that was too much of a bother. Instead, I learned to use the command line interface for compiling and running code.
Compiling and Running Programs
To compile and run programs you will need to use command line commands.
To build a new project, create a project in Eclipse by creating a new java project. The one thing it needs is a pointer to the Hadoop 0.20 libraries (actually "jars"). To install a pointer to this library, do the following after creating the project:
- Right click on the project name and choose "Properties".
- Click on "Java Build Path" and go to the "Libraries" tab.
- Click on "Add External JARs".
- Navigate to /usr/lib/hadoop-0.20 and choose hadoop-0.20.1+133-core.jar.
- Click "OK" on the windows until you are out.
Second, you should create a package and then source code in the package.
After you have created the project, you can compile and run the code from a command line by doing the following.
- Go to the project directory (~/workshop/
- Issue the following command: "javac -cp /usr/lib/hadoop-0.20/hadoop-0.20.1+133-core.jar -d bin/ src/*/*.java" [note: there is a space after "bin/"].
- Create the jar: "cd bin; jar ../
cvf */*; cd ..". So, for the RowNumberTwoPass command, I use: "cd bin; jar cvf ../RowNumberTwoPass.jar */*; cd ..".
- Run the code using the command: "hadoop jar RowNumberTwoPass.jar RowNumberTwoPass/rownumbertwopass". The first argument after "hadoop jar" is the jar file with the code. The second is the class and package where the main() function is located.