Sunday, November 29, 2009

Hadoop and MapReduce: Switching to 0.20 and Cloudera

Recently, I decided to switch from Hadoop 0.18 to 0.20 for several reasons:
  1. I'm getting tired of using deprecated features -- it is time to learn the new interface.
  2. I would like to use some new features, specifically MultipleInputFormats.
  3. The Yahoo! Virtual Machine (which I recommended in my first post) is not maintained, whereas the Cloudera training machine is.
  4. And, for free software, I have so far found the Cloudera community support quite effective.
I chose the Cloudera Virtual Machine for a simple reason: it was recommended by Jeff, who works there and describes himself as "a big fan of [my data mining] books". I do not know if there are other VMs that are available, and I am quite happy with my Cloudera experience so far. Their community support provided answers to key questions, even over the Thanksgiving long weekend.

That said, there are a few downsides to the upgrade:
  • The virtual machine has the most recent version of Eclipse (called Ganymede), which does not work with Hadoop.
  • Hence, the virtual machine requires using command lines for compiling the java code.
  • I haven't managed to get the virtual machine to share disks with the host (instead, I send source files through gmail).
The rest of this post explains how I moved the code that assigns consecutive row numbers (from my previous post) to Hadoop 0.20. It starts with details about the new interface and then talks about updating to the Cloudera virtual machine.


Changes from Hadoop 0.18 to 0.20

The updated code with the Hadoop 0.20 API is in RowNumberTwoPass-0.20.java.

Perhaps the most noticeable change is the packages. Before 0.20, Hadoop used classes in a package called "mapred". Starting with 0.20, it uses classes in "mapreduce". These have a different interface, although it is pretty easy to switch from one to the other.

The reason for this change has to do with future development for Hadoop. This change will make it possible to separate releases of HDFS (the distributed file system) and releases of MapReduce. The following are packages that contain the new interface:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.map.*;
import org.apache.hadoop.mapreduce.lib.reduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;

In the code itself, there are both subtle and major code differences. I have noticed the following changes in the Map and Reduce classes:
  • The classes now longer need the "implements" syntax.
  • The function called before the map/reduce is now called setup() rather than configure().
  • The function called after the map/reduce is called cleanup().
  • The functions all take an argument whose class is Context; this is used instead of Reporter and OutputCollector.
  • The map and reduce functions can also throw InterruptedException.
The driver function has more changes, caused by the fact that JobConf is no longer part of the interface. Instead, the work is set up using Job. Variables and values are passed into the Map and Reduce class through Conf rather than JobConf. Also, the code for the Map and Reduce classes is added in using the call Job.setJarByClass().

There are a few other minor coding differences. However, the code follows the same logic as in 0.18, and the code ran the first time after I made the changes.


The Cloudera Virtual Machine

First, I should point out that I have no connection to Cloudera, which is a company that makes money (or intends to make money) by providing support and training for Hadoop.

The Cloudera Virtual Machine is available here. It requires running a VMWare virtual machine, which is available here. Between the two, these are about 1.5 Gbytes, so have a good internet connection when you want to download them.

The machine looks different from the Yahoo! VM, because it runs X rather than just a terminal interface. The desktop is pre-configured with a terminal, Eclipse, Firefox, and perhaps some other stuff. When I start the VM, I open the terminal and run emacs in the background. Emacs is a text editor that I know well from my days as a software programmer (more years ago than I care to admit). To use the VM, I would suggest that you have some facility with either emacs or VI.

The version of Hadoop is 0.20.1. Note that as new versions are released, Cloudera will probably introduce new virtual machines. Any work you do on this machine will be lost when you replace the VM with a newer version. As I said, I am sending source files back and forth via gmail. Perhaps you can get the VM to share disks with the host machine. The libraries for Hadoop are in /usr/lib/Hadoop-2.0.

Unfortunately, the version of Eclipse installed in the VM does not fully support Hadoop (if you want to see the bug reports, google something like "Hadoop Ganymede"). Fortunately, you can use Eclipse/Ganymede to write code, and it does full syntax checking. However, you'll have to compile and run the code outside the Eclipse environment. I believe this is a bug in this version of Eclipse, which will hopefully be fixed sometime in the near future.

I suppose that I could download the working version of Eclipse (Europe, which is, I think, version 3.2). But, that was too much of a bother. Instead, I learned to use the command line interface for compiling and running code.


Compiling and Running Programs

To compile and run programs you will need to use command line commands.

To build a new project, create a project in Eclipse by creating a new java project. The one thing it needs is a pointer to the Hadoop 0.20 libraries (actually "jars"). To install a pointer to this library, do the following after creating the project:
  • Right click on the project name and choose "Properties".
  • Click on "Java Build Path" and go to the "Libraries" tab.
  • Click on "Add External JARs".
  • Navigate to /usr/lib/hadoop-0.20 and choose hadoop-0.20.1+133-core.jar.
  • Click "OK" on the windows until you are out.
You'll see the new library in the project listing.

Second, you should create a package and then source code in the package.

After you have created the project, you can compile and run the code from a command line by doing the following.
  • Go to the project directory (~/workshop/).
  • Issue the following command: "javac -cp /usr/lib/hadoop-0.20/hadoop-0.20.1+133-core.jar -d bin/ src/*/*.java" [note: there is a space after "bin/"].
  • Create the jar: "cd bin; jar ../ cvf */*; cd ..". So, for the RowNumberTwoPass command, I use: "cd bin; jar cvf ../RowNumberTwoPass.jar */*; cd ..".
  • Run the code using the command: "hadoop jar RowNumberTwoPass.jar RowNumberTwoPass/rownumbertwopass". The first argument after "hadoop jar" is the jar file with the code. The second is the class and package where the main() function is located.
Although this seems a little bit complicated, it is only cumbersome the first time you run it. After that, you have the commands and running them again is simple.

1 comment:

  1. At the compiling step it is giving an error

    training@training-vm:~/workspace/second$ javac -cp /usr/lib/hadoop-0.20/hadoop-0.20.1+152-core.jar -d bin/ src/p1/WordCount.java src/p1/WordCount.java:53: cannot access org.apache.commons.cli.Options
    class file for org.apache.commons.cli.Options not found
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    can you help me with this !!!

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.