Wednesday, November 18, 2009

Getting Started with Hadoop and MapReduce

Over the past week, I have been teaching myself Hadoop -- a style of programming made popular by Google (who invented the parallel version of MapReduce), Yahoo! (who created the open source version called Hadoop), and Amazon (who provides cloud computing resources called EC2 for running the software).

My purpose is simple: one of our clients has a lot of web log data on Amazon S3 (which provides lots of cheap data storage). This data can be analyzed on EC2, using Hadoop. Of course, the people who put the data up there are busy running the web site. They are not analysts/data miners. Nor do they have much bandwidth to help a newbie get started. So, this has been a do-it-yourself effort.

I have some advice for anyone else who might be in a similar position. Over the past few days, I have managed to turn my Windows Vista laptop into a little Hadoop machine, where I can write code in an IDE (that means "interactive development editor", which is what you program in ), run it on a one-node hadoop cluster (that is, my laptop), and actually get results.

There are several ways to get started with Hadoop. Perhaps you work at a company that has Hadoop running on one or more clusters or has a cluster in the cloud. You can just talk to people where you are and get started.

You can also endeavor to install Hadoop yourself. There is a good chance that I will attempt this one day. However, it is supposed to be a rather complicated process. And, all the configuration and installation is a long way from data.

My preferred method is to use a Hadoop virtual machine. And I found a nice one through the Yahoo Hadoop Tutorial (and there are others . . . I would like to find one running the most recent version of Hadoop).

I have some advice, corrections and expectations for anyone else who wants to try this.

(1) Programming languages.

Hopefully you already know some programming languages, preferably of the object-oriented sort. If you are a java programmer, kudos to you, since hadoop is written in java. However, I found that I can struggle through java with my rush knowledge of C++ and C#. (This post is *not* about my complaints about java.)

For the purposes of this discussion, I do not consider SAS or SPSS to be worthy programming languages. You should be familiar with ideas such as classes, constructors, static and instance variables, functions, and class inheritance. You should also be willing to read through long run-time error messages, whose ultimate meaning is something like "I was expecting a constructor with no arguments" or "I received a Text type when I was expecting an IntWritable."


(2) Unix Familiarity

In the same way that Hadoop was developed using java, it was developed on a Unix environment. This means that you should be familiar with Unix shell commands (you know, "ls" versus "dir", "rm" versus "del", and so on). There is a rumor that a version of Hadoop will run under Windows. I am sure, though, that trying to install open source Unix-based software under Windows is going to be a major effort; so I chose the virtual machine route.

By the way, if you want to get Unix shell commands on your Windows box, then use cygwin! It provides the common Unix commands. Just download the most recent version from www.cygwin.com.


(3) Use Yahoo!'s Hadoop Tutorial (which is here)

This has proven invaluable in my efforts. Although I have a few corrections and clarifications, which I'll describe below.


(4) Be Prepared to Load Software! (And have lots of disk space)

I have loaded the following software packages on my computer, to make this work:
  • VMWare Player 3.0 (from VM Ware). This is free software that allows any computer to run a virtual desktop in another operating system.
  • Hadoop VM Appliance (from Yahoo). This has the information for running a Hadoop virtual machine.
  • Eclipse SDE, version 3.3.2 (from the Eclipse project archives). Note that the Tutorial suggests version 3.3.1. Version 3.3.2 works. Anything more recent is called "Ganymede", and it just doesn't work with Hadoop.
  • Java Development Kit from Sun.
These are all multi-hundred megabyte zipped files.


(5) Getting VMWare to Share Folders between the host and virtual machine

In order to share folders (the easiest way to pass data back and forth), you need to install VMWare Tools. VMWare has instructions here. However, it took me a while to figure out what to do. The real steps are:

(a) Under VM-->Settings go to the Hardware tab. Set the CD/DVD to be "connected" and use the ISO image file. My path is "C:\Program Files (x86)\VMware\VMware Player\linux.iso", but I believe that it appeared automatically, after a pause. This makes the virtual machine think it is reading from a CD when it is really reading from this file.

(b) Run the commands to mount and extract files
Login or su to root and then run the following:
mount /cdrom
cd /tmp
tar zxf /cdrom/vmware-freebsd-tools.tar.gz
umount /cdrom

(c) Do the installation
I accepted all the defaults when it asked a question.

cd vmware-tools-distrib
./vmware-install.pl
vmware-config-tools.pl

(d) Set up a shared folder

Go to VM --> Settings... and go to the Options tab. Enable the shared folders. Choose an appropriate folder on the host machine (I chose "c:\temp") and give it a name on the virtual machine ("temp").

(e) The folder is available on the virtual machine at /mnt/hgfs/temp.


(6) Finding the Hadoop Plug-In for Eclipse

The Tutorial has the curious instructions in the third part "In the hadoop-0.18.0/contrib/eclipse-plugin directory on this CD, you will find a file named hadoop-0.18.0-eclipse-plugin.jar. Copy this into the plugins/ subdirectory of wherever you unzipped Eclipse." These are curious because there is no CD.

You will find this directory on the Virtual Machine. Go to it. Copy the jar file to the shared folder. Then go to the host machine and copy it to the described place.


(7) The code examples may not work in the Tutorial

I believe the following code is not correct in the driver code:
   FileInputPath.addInputPath(conf, new Path("input"));
FileOutputPath.addOutputPath(conf, new Path("output"));

The following works:
   conf.setInputPath(new Path("input"));
conf.setOutputPath.addOutputPath(new Path("output"));

(8) Thinking the Hadoop Way

Hadoop is a different style of programming, because it is distributed. Actually, there are three different "machines" that it uses.

The first is your host machine, which is where you develop code. Although you can develop software on the Hadoop machine, the tools are better on your desktop.

The second is the Hadoop machine, which is where you can issue the commands related to Hadoop. In particular, the command "hadoop" provides access to the parallel data. This machine has a shared drive with the host machine, which you can use to move files back and forth.

The data used by the programs, though, is in a different place, the Hadoop Distributed File System (HDFS). To move data between HDFS and the virtual machine, use the "hadoop fs" command. Using the shared folder you can move it to the local machine or anywhere.

--gordon

10 comments:

  1. Thanks. Even though I probably won't have a project that leads me in this direction, this was very intersting. Very well written too. Keep up the good posts - Jake Posey

    ReplyDelete
  2. Thank you very much for sharing this. It was useful enough but I think if you explained at the end what you mean by "Using the shared folder you can move it to the local machine or anywhere." it would be much appreciated.

    Thank you very much

    ReplyDelete
  3. Hi
    Can you please help?
    I used the code below as told by you but even it does not work.
    conf.setInputPath(new Path("input")); conf.setOutputPath.addOutputPath(new Path("output"));
    Error is : conf.setOutputPath can not be resolved.

    and in the case fo code given in the tutorial
    Error is :FileInputPath can not be resolved

    ReplyDelete
  4. It actually should be:

    FileInputFormat.addInputPath(conf, new Path("output"));
    FileOutputFormat.setOutputPath(conf, new Path("output"));

    ReplyDelete
  5. @Zhanibek
    FileInputFormat.addInputPath(conf, new Path("input"));
    FileOutputFormat.setOutputPath(conf, new Path("output"));

    ReplyDelete
  6. that also doesnt work

    ReplyDelete
  7. Thanks so much for posting these instructions. They have been enormously helpful. I have got to point 7 in your list above but can't ping the linux guest from the host. Hence when it comes to setting up eclipse there seems to be no way to get eclipse on the host to communicate with the VMWare linux guest.

    ReplyDelete
  8. after Sreenidhis Step now this is the error

    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 2 more

    ReplyDelete
  9. FYI there is a CD is in the table of Contents
    http://developer.yahoo.com/hadoop/tutorial/index.html

    You can download all the tutorials in there, and also the eclipse plugin.

    ReplyDelete
  10. Might want to mention that the linux.iso is part of the tools that you have to download in addition to the VMPlayer, so when it asks if you want to download the tools, say yes.

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.