My purpose is simple: one of our clients has a lot of web log data on Amazon S3 (which provides lots of cheap data storage). This data can be analyzed on EC2, using Hadoop. Of course, the people who put the data up there are busy running the web site. They are not analysts/data miners. Nor do they have much bandwidth to help a newbie get started. So, this has been a do-it-yourself effort.
I have some advice for anyone else who might be in a similar position. Over the past few days, I have managed to turn my Windows Vista laptop into a little Hadoop machine, where I can write code in an IDE (that means "interactive development editor", which is what you program in ), run it on a one-node hadoop cluster (that is, my laptop), and actually get results.
There are several ways to get started with Hadoop. Perhaps you work at a company that has Hadoop running on one or more clusters or has a cluster in the cloud. You can just talk to people where you are and get started.
You can also endeavor to install Hadoop yourself. There is a good chance that I will attempt this one day. However, it is supposed to be a rather complicated process. And, all the configuration and installation is a long way from data.
My preferred method is to use a Hadoop virtual machine. And I found a nice one through the Yahoo Hadoop Tutorial (and there are others . . . I would like to find one running the most recent version of Hadoop).
I have some advice, corrections and expectations for anyone else who wants to try this.
(1) Programming languages.
Hopefully you already know some programming languages, preferably of the object-oriented sort. If you are a java programmer, kudos to you, since hadoop is written in java. However, I found that I can struggle through java with my rush knowledge of C++ and C#. (This post is *not* about my complaints about java.)
For the purposes of this discussion, I do not consider SAS or SPSS to be worthy programming languages. You should be familiar with ideas such as classes, constructors, static and instance variables, functions, and class inheritance. You should also be willing to read through long run-time error messages, whose ultimate meaning is something like "I was expecting a constructor with no arguments" or "I received a Text type when I was expecting an IntWritable."
(2) Unix Familiarity
In the same way that Hadoop was developed using java, it was developed on a Unix environment. This means that you should be familiar with Unix shell commands (you know, "ls" versus "dir", "rm" versus "del", and so on). There is a rumor that a version of Hadoop will run under Windows. I am sure, though, that trying to install open source Unix-based software under Windows is going to be a major effort; so I chose the virtual machine route.
By the way, if you want to get Unix shell commands on your Windows box, then use cygwin! It provides the common Unix commands. Just download the most recent version from www.cygwin.com.
(3) Use Yahoo!'s Hadoop Tutorial (which is here)
This has proven invaluable in my efforts. Although I have a few corrections and clarifications, which I'll describe below.
(4) Be Prepared to Load Software! (And have lots of disk space)
I have loaded the following software packages on my computer, to make this work:
- VMWare Player 3.0 (from VM Ware). This is free software that allows any computer to run a virtual desktop in another operating system.
- Hadoop VM Appliance (from Yahoo). This has the information for running a Hadoop virtual machine.
- Eclipse SDE, version 3.3.2 (from the Eclipse project archives). Note that the Tutorial suggests version 3.3.1. Version 3.3.2 works. Anything more recent is called "Ganymede", and it just doesn't work with Hadoop.
- Java Development Kit from Sun.
(5) Getting VMWare to Share Folders between the host and virtual machine
In order to share folders (the easiest way to pass data back and forth), you need to install VMWare Tools. VMWare has instructions here. However, it took me a while to figure out what to do. The real steps are:
(a) Under VM-->Settings go to the Hardware tab. Set the CD/DVD to be "connected" and use the ISO image file. My path is "C:\Program Files (x86)\VMware\VMware Player\linux.iso", but I believe that it appeared automatically, after a pause. This makes the virtual machine think it is reading from a CD when it is really reading from this file.
(b) Run the commands to mount and extract files
Login or su to root and then run the following:
tar zxf /cdrom/vmware-freebsd-tools.tar.gz
(c) Do the installation
I accepted all the defaults when it asked a question.
(d) Set up a shared folder
Go to VM --> Settings... and go to the Options tab. Enable the shared folders. Choose an appropriate folder on the host machine (I chose "c:\temp") and give it a name on the virtual machine ("temp").
(e) The folder is available on the virtual machine at /mnt/hgfs/temp.
(6) Finding the Hadoop Plug-In for Eclipse
The Tutorial has the curious instructions in the third part "In the hadoop-0.18.0/contrib/eclipse-plugin directory on this CD, you will find a file named hadoop-0.18.0-eclipse-plugin.jar. Copy this into the plugins/ subdirectory of wherever you unzipped Eclipse." These are curious because there is no CD.
You will find this directory on the Virtual Machine. Go to it. Copy the jar file to the shared folder. Then go to the host machine and copy it to the described place.
(7) The code examples may not work in the Tutorial
I believe the following code is not correct in the driver code:
FileInputPath.addInputPath(conf, new Path("input"));
FileOutputPath.addOutputPath(conf, new Path("output"));
The following works:
(8) Thinking the Hadoop Way
Hadoop is a different style of programming, because it is distributed. Actually, there are three different "machines" that it uses.
The first is your host machine, which is where you develop code. Although you can develop software on the Hadoop machine, the tools are better on your desktop.
The second is the Hadoop machine, which is where you can issue the commands related to Hadoop. In particular, the command "hadoop" provides access to the parallel data. This machine has a shared drive with the host machine, which you can use to move files back and forth.
The data used by the programs, though, is in a different place, the Hadoop Distributed File System (HDFS). To move data between HDFS and the virtual machine, use the "hadoop fs" command. Using the shared folder you can move it to the local machine or anywhere.