Saturday, December 5, 2009

Hadoop and MapReduce: What Country is an IP Address in?

I have started using Hadoop to sessionize web log data. It has surprised me that there is not more written on this subject on the web, since I thought this was one of the more prevalent uses of Hadoop. Because I'm doing this work for a client, using Amazon EC2, I do not have sample data web log data files to share.

One of the things that I want to do in the sessionization code is to include what country the user is in. Typically, the only source of location information in such logs is the IP address used for connecting to the internet. How can I look up the country the IP address is in?

This posting describes three things: the source of the IP geography information, new things that I'm learning about java, and how to do the lookup in Hadoop.


The Source of IP Geolocation Information

MaxMind is a company that has a specialty in geolocation data. I have no connection to MaxMind, other than a recommendation to use their software from someone at the client where I have been doing this work. There may be other companies with similar products.

One way they make money by offering a product called GeoIp Country which has very, very accurate information about the country where an IP is located (they also offer more detailed geographies, such as regions, states, and cities, but country is sufficient for my purposes). Their claim is that GeoIP Country is 99.8% accurate.

Although quite reasonably priced, I am content to settle for the free version, called GeoLite Country, for which the claim is 99.5% accuracy.

These products come in two parts. The first part is an interface, which is available for many languages, with the java version here. I assume the most recent version is the best, although I happen to be using an older version.

Both the free and paid versions use the same interface, which is highly convenient, in case I want to switch between them. The difference is the database, which is available from this download page. The paid version has more complete coverage and is updated more frequently.

The interface consists of two important components:
  • Creating a LookupService object, which is instantiated with an argument that names the database file.
  • Using LookupService.getCountry() to do the lookup.
Simple enough interface; how do we get it to work in java, and in particular, in java for Hadoop?


New Things I've Learned About Java

As I mentioned a few weeks ago in my first post on learning Hadoop, I had never used java prior to this endavor (although I am familiar with other object oriented programming languages such as C++ and C#). I have been learning java on an "as needed" basis, which is perhaps not the most efficient way overall but has been the fastest way to get started.

When programming java, there are two steps. I am using the javac command to compile code into class files. Then I'm using the jar command to create a jar file. I have been considering this the equivalent of "compiling and linking code", which also takes two steps.

However, the jar file is much more versatile than a regular executable image. In particular, I can put any files there. These files are then available in my application, although java calls them "resources" instead of "files". This will be very important in getting MaxMind's software to work with Hadoop. I can include the IP database in my application jar file, which is pretty cool.

There is a little complexity, though, which involves the paths of where there are located. When using hadoop, I have been using statements such as "org.apache.hadoop.mapreduce" without really understand them. This statement brings in classes associated with the mapreduce package, because three things have happened:
  • The original work (at apache) was done in a directory structure that included ./org/apache/hadoop/mapreduce.
  • The tar file was created in that (higher-level) directory. Note that this could be buried deep down in the directory hierarchy. Everything is relative to the directory where the tar file is created.
  • I am including that tar file explicitly in my javac command, using the -cp argument which specifies a class path.
All of this worked without my having to understand it, because I had some examples of working code. The MaxMind code then poses a new problem. This is the first time that I have to get someone else's code to work. How do we do this?

First, after you uncompress their java code, copy the com directory to the place where you create your java jar file. Actually, you could just link the directories. Or, if you know what you are doing, then you may have another solution.

Next, for compiling the files, I modified the javac command line, so it read: javac -cp .:/opt/hadoop/hadoop-0.20.1-core.jar:com/maxmind/geoip [subdirectory]/*.java. That is, I added the geoip directory to the class path, so java can find the class files.

The class path can accept either a jar file or a directory. When it is a jar file, javac looks for classes in the jar file. When it is a directory, it looks for classes in the directory (but not in subdirectories). That is simple enough. I do have to admit, though, that it wasn't obvious when I started. I don't think of jar files and directories as being equivalent. But they are.

Once the code compiles, just be sure to include the com/maxmind/geoip/* files in the jar command. In addition, I also copied over the GeoLite Country database and included it in the jar file. Do note that the path used to put things in the jar file makes a difference! So, "jar ~/maxmind/*.dat" behaves differently from "jar ./*.dat", when we want to use the data file.


Getting MaxMind to Work With Hadoop

Things are a little complicated in the Hadoop world, because we need to pass in a database file to initialize the MaxMind classes. My first attempt was to initialize the lookup service in the map class using code like:

iplookup = new LookupService("~/maxmind/GeoIP.dat",
.............................LookupService.GEOIP_MEMORY_CACHE |
.............................LookupService.GEOIP_CHECK_CACHE);


This looked right to me and was similar to code that I found in various placed on the internet.

Guess what? It didn't work. And it didn't work for a fundamentally important reason. Map classes are run on the distributed nodes, and the distributed nodes do not have access to the local file system. Duh, this is why the HDFS (hadoop distributed file system) was invented!

But now, I have a problem. There is a reasonably sized data file -- about 1 Mbyte. Copying it to the HDFS does not really solve my problem, because it is not an "input" into the Map routine. I suppose, I could copy it and then figure out how to open it as a sequence file, but that is not the route I took.

Up to this point, I had found three ways to get information into the Map classes:
  1. Compile it in using constants.
  2. Pass small amounts on the Conf structure, using the various set and get functions. I have examples of this in the row number code.
  3. Use the distributed cache. I haven't done this yet, because there is warning about setting it up correctly using configuration xml files. Wow, that is something that I can easily get wrong. I'll learn this when I think it is absolutely necessary, knowing that it might take a few hours to get it right.
But now, I've discovered that java has an amazing fourth way: I can pass files in through the jar file. Remember, when we use Hadoop, we call a function "setJarbyClass()". Well, this function takes the class that is passed in and sends the entire jar file with the class to each of distributed nodes (for both the Map and Reduce classes). Now, if that jar file just happens to contain a data file with ip address to country lookup data, then java has conspired to send my database file exactly where it is needed!

Thank you java! You solved this problem. (Or, should I be thanking Hadoop?)

The only question is how to get the file out of the jar file. Well, the things in the jar file are called "resources". Resources are accessed using uniform resource identifiers (URI). And, the URI is conveniently built out of the file name. Life is not so convenient that the URI is the file name. But, it is close enough. The URI prepends the file name with something (say, "http:").

So, to get the data file out of the jar file (which we put in using the jar command), we need to:
  • figure out the name for the resource in the jar file;
  • convert the resource name to a file name; and then,
  • open this just as we would a regular file (by passing it into the constructor).
The code to do this is:

import com.maxmind.geoip;
...
if (iplookup == null) {
....String filename = getClass().getResource("/GeoIP.dat").toExternalForm().substring(5);
....iplookup = new LookupService(filename, LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

The import tells the java code where to find the LookupService class. To make this work, we have to include the appropriate directory in the class path, as described earlier.

The first statement creates the file name. The resource name "/GeoIP.dat" says that the resource is a file, located in the directory where the tar file was created. The rest of the statement converts this to a file name. The function "toExternalForm()" creates a URI, which is the filename prepended with something. The substring(5) removes the something (I didn't look, but wouldn't be surprised if it were "http:"). The original example code I found had substring(6), which did not work for me on EC2.

The second statement passes this into the lookup service constructor.

Now the lookup service is available, and I can use it via this code:

this.ipcountry = iplookup.getCountry(sale.ip).getCode();

Voila! From the IP address, I am able to use free code downloaded from the internet to lookup the IP address using the distributed power of Hadoop.

1 comment:

  1. hi when we think of mapping an ip address it is better pretty when we log onto www.ip-details.com.

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.