Monday, December 28, 2009

Differential Response or Uplift Modeling

Some time before the holidays, we received the following inquiry from a reader:

Dear Data Miners,



I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it. I have responses from a direct mailing with a treatment group and a control group. Now what? Without data mining, I can calculate the uplift between the two groups but not for individual responses. With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group. How is uplift modeling implemented in general, and how could it be done in R or Weka?



[1] http://www.stochasticsolutions.com/pdf/CrossSell.pdf

I first heard the term "uplift modeling" from Nick Radcliffe, then of Quadstone. I think he may have invented it. In our book, Data Mining Techniques, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.

Here, the quantity  to measure is the difference in response rate or average order size between treated and untreated groups of otherwise similar customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"

Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?  That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.

A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.  That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.

Sunday, December 27, 2009

Hadoop and MapReduce: Characterizing Data

This posting describes using Hadoop and MapReduce to characterize data -- that is, to summarize the values in various columns to learn about the values in each column.

This post describes how to solve this problem using Hadoop. It also explains why Hadoop is better for this particular problem than SQL.

The code discussed in this post is available in these files: GenericRecordMetadata.java, GenericRecord.java, GenericRecordInputFormat.java, and Characterize.java. This work builds on the classes introduced in my previous post Hadoop and MapReduce: Method for Reading and Writing General Record Structures (the versions here fix some bugs in the earlier versions).

What Does This Code Do?

The purpose of this code is to provide summaries for data in a data file. Being Hadoop, the data is stored in a delimited text format, with one record per line, and the code uses GenericRecord to handle the specific data. The generic record classes are things that I wrote to handle this situation; the Apache java libraries apparently have other approaches to solving this problem.

The specific summaries for each column are:
  • Number of records.
  • Number of values.
  • Minimum and maximum values for string variables, along with the number of times the minimum and maximum values appear in the data.
  • Minimum and maximum lengths for string variables, along with the number of times these appear and an example of the value.
  • First, second, and third most common string values.
  • Number of times the column appears to be an integer.
  • Minimum and maximum values when treating the values as integers, along with the number of times that these appear.
  • Number of times the column appears to contain a real number.
  • Minimum and maximum values when treating the values as doubles, along with the number of times that these appear.
  • Count of negative, zero, and positive values.
  • Average value.
These summaries are arbitrary. The code should be readily extensible to other types and other summaries.

My ultimate intention is to use this code to easily characterize input and result files that I create in the process of writing Hadoop code.


Overview of the Code

The characterize problem is solved in two steps. The first creates a histogram of all the values in all the columns, and the second summarizes the histogram of values, which is handled by two passes of map reduce.

The histogram step takes files with the following format:
  • Key: undetermined
  • Values: text values separated by a delimited (by default a tab)
(This is the GenericRecord format.)
The Map phase produces a file of the format:
  • Key: column name and column value, separated by a colon
  • Value: "1"
Combine and Reduce then add up the "1"s, producing a file of the format:
  • Key: column name
  • Value: column value separated by tab
Using a tab as a separator is a convenience, because this is also the default separator for the key.

The second phase of the Map/Reduce job takes the previous output and uses the reduce function to summarize all the different values in the histogram. This code is quite specific to the particular summaries. The GenericRecord format is quite useful because I can simply add new summaries in the code, without worrying about the layout of the records.

The code makes use of exception processing to handle particular data types. For instance, the following code block handles the integer summaries:

try {
....long intval = Long.parseLong(valstr);
....hasinteger = true;
....intvaluecount++;
....intrecordcount += Long.parseLong(val.get("count"));
}
catch (Exception NumberFormatException) {
....// we don't have to do anything here
}

This block tries to convert the value to an integer (actually to a long). When this works, then the code updates the various variables that characterize integer values. When this fails, the code continues working.

There is a similar block for real numbers, and I could imagine adding more such blocks for other formats, such as dates and times.

Why MapReduce Is Better Than SQL For This Task

Characterizing data is the process of summarizing data along each column, to get an idea of what is in the data. Normally, I think about data processing in terms of SQL (after all, my most recent book is Data Analysis Using SQL and Excel). SQL, however, is particularly poor for this purpose.

First, SQL has precious few functions for this task -- basically MIN(), MAX(), AVG() and judicious use of the CASE statement. Second, SQL generally has lousy support for string functions and inconsistent definitions for date and time functions across different databases.

Worse, though, is that traditional SQL can only summarize one column at a time. The traditional SQL approach would be to summarize each column individually in a query and then connect them using UNION ALL statements. The result is that the database has to do a full-table scan for each column.

Although not supported in all databases, SQL syntax does now support the GROUPING SETS keyword which helps potentially alleviate this problem. However, GROUPING SETS is messy, since the key columns each have to be in separate columns. That is, I want the results in the format "column name, column value". With GROUPING SETS, I get "column1, column2 ... columnN", with NULLs for all unused columns, except for the one with a value.

The final problem with SQL occurs when the data starts out in text files. Much of the problem of characterizing and understanding the data happens outside the database during the load process.

Tuesday, December 22, 2009

Interview with Eric Siegel

This is the first of what may become an occasional series of interviews with people in the data mining field. Eric Siegel is the organizer of the popular  Predictive Analytics World conference series. I asked him a little bit about himself and gave him a chance to plug his conference.  A propos, readers of this blog can get a 15% discount on a two-day conference pass by pasting the code DATAMINER010 into the Promotional Code box on the conference registration page.

Q: Not many kids (one of mine is perhaps the exception that proves the rule) have the thought "when I grow up, I want to be a data miner!"  How did you fall into this line of work?

To many laypeople, the word "data" sounds dry, arcane, meaningless - boring! And number-crunching on it doubly so. But this is actually the whole point. Data is the uninterpreted mass of things that've happened.  Extracting what's up, the means behind the madness, and in so doing modeling and learning about human behavior... well, I feel nothing in science or engineering is more interesting.
In my "previous life" as an academic researcher, I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as I grew up from childhood, that space travel would in fact be a tremendous, grueling pain in the neck (not fun like "Star Wars"), nothing in science has ever seemed nearly as exciting.


In my current 9-year career as a commercial practitioner, I've found that indeed the ability to analytically "learn" and apply what's been learned turns out to provide plenty of business value, as I imagined back in the lab.  Research science is fun in that you have the luxury of abstraction and are often fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although some less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and deliver an irrefutable impact.


Q: Most conferences happen once a year.  Why does PAW come around so much more frequently?

In fact, many commercial conferences focused the industrial deployment of technology occur multiple times per year, in contrast to research conferences, which usually take place annually.  There's an increasing demand for a more frequent commercial event as predictive analytics continues to "cross chasms" towards more widescale penetration. There's just too much to cover - too many brand-name case studies and too many hot topics - to wait a year before each event.


Q: You use the phrase "predictive analytics" for what I've always called "data mining." Do the terms mean something different, or is it just that fashions change with the times?


"Data mining" is indeed often used synonymously with "predictive analytics", but not always. Data mining's definitions usually entail the discovery of non-trivial, useful patterns/knowledge/insights from data -- if you "dig" enough, you get a "nugget." This is a fairly abstract definition and therefore envelops a wide range of analytical techniques. On the other hand, predictive analytics is basically the commerical deployment of predictive modeling specifically (that is, in academic jargon, supervised learning, i.e., optimizing a statitistical model over labeled/historical cases). In business applications, this basically translates to a model that produces a score for each customer, prospect, or other unit of interest (business/outlet location, SKU, etc), which is roughly the working definition we posted on the Predictive Analytics World website. This would seem to potentially exclude related data mining methods such as forecasting, association mining and clustering (unsupervised learning), but, naturally, we include some sessions at the conference on these topics as well, such as your extremely-well-received session on forecasting October 2009 in DC.



Q: How do you split your time between conference organizing and analytical consulting work?  (That's my polite way of trying to rephrase a question I was once asked: "What's the split between spewing and doing?")

When one starts spewing a lot, there becomes much less time for doing. In the last 2 years, as my 2-day seminar on predictive analytics has become more frequent (both as public and customized on-site training sessions - see http://www.businessprediction.com), and I helped launch Predictive Analytics World, my work in services has become less than half my time, and I now spend very little time doing hands-on, playing a more advisory and supervisory role for clients, alongside other senior consultants who do more hands-on for Prediction Impact services engagements.


Q: I can't help noticing that you have a Ph.D.  As someone without any advanced degrees, I'm pretty good at rationalizing away their importance, but I want to give you a chance to explain what competitive advantage it gives you.

The doctorate is a research-oriented degree, and the Ph.D. dissertation is in a sense a "hazing" process. However, it's become clear to me that the degree is very much net positive for my commercial career. People know it entails a certain degree of discipline and aptitude. And, even if I'm not conducting academic research most of the time, every time one applies analytics there there is an experimental component to the task. On the other hand, many of the best data miners - the "rock star" consultants such as yourself - did not need a doctorate program in order to become great at data mining.



Q: Moving away from the personal, how do you think the move of data and computing power into the cloud is going to change data mining?

I'd say there's a lot of potential in making parallelized deployment more readily available to any and all data miners.  But, of all the hot topics in analytics, I feel this is the one into which I have the least visibility. It does, after all, pertain more to infrastucture and support than to the content, meaning and insights gained from analysis.

But, turning to the relevant experts, be sure to check out Feb PAW's upcoming session, "In-database Vs. In-cloud Analytics: Implications for Deployment" - see http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-7


Q: Can you give examples of problems that once seemed like hot analytical challenges that have now become commoditized?

Great question. Hmm... common core analyical methods such as decision trees and logistic regression may be the only true commodities to date in our field. What do you think?

Q: There are some tasks that we used to get hired for 10 or 15 years ago that no one comes to us for these days. Direct mail response models is an example. I think people feel like they know how to do those themselves. Or maybe that is something the data vendors pretty much give away with the data.

Which of today's hot topics in data mining do you see as ripe for commiditization?

UPLIFT (incremental lift) modeling is branching out, with applications going beyond response and churn modeling (see http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-2).

Expanding traditional data sets with SOCIAL DATA is continuing to gain traction across a growing range of verticals as analytics pracitioners find great value (read: tremendous increases in model lift) leveraging the simple fact that people behave similarly to those to whom they're socially connected. Just as the healthcare industry has discovered that quitting smoking is "contagious" and that the risk of obesity dramatically increases if you have an obese friend, telecommunications, online social networks and other industries find that "birds of a feather" churn and even commit fraud "together". Is this more because people influence one-another, or because they befriend others more like themselves?  Either way, social connections are hugely predictive of the customer behaviors that matter to business.



Q: There have been several articles in the popular press recently, like this one in the NY Times,  saying that statistics and data mining are the hottest fields a young person could enter right now.  Do you agree?

Well, for the subjective reasons in my answer to your first question above, I would heartily agree. If I recall, that NY Times article focused on the demand for data miners as the career's central appeal. Indeed, it is a very marketable skill these days, which certainly doesn't hurt.

Friday, December 18, 2009

Hadoop and MapReduce: Method for Reading and Writing General Record Structures

I'm finally getting more comfortable with Hadoop and java, and I've decided to write a program that will characterize data in parallel files.

To be honest, I find that I am spending a lot of time writing new Writable and InputFormat classes, every time I want to do something. Every time I introduce a new data structure used by the Hadoop framework, I have to define two classes. Yucch!

So, I put together a simple class called GenericRecord that can store a set of column names (as string) and a corresponding set of column values (as strings). These are stored in delimited files, and the various classes understand how to parse these files. In particular, the code can read any tab delimited file that has column names on the first row (and changing the delimitor should be easy). One nice aspect is the ability to use the GenericRecord as the output of a reduce function, which means that the number and names of the output can be specified in the code -- rather than in additional files with additional classes.

I wouldn't be surprised if similar code already exists with more functionality than the code I have here. This effort is also about my learning Hadoop.

This posting provides the code and explains important features on how it works. The code is available in these files GenericRecord.java, GenericRecordMetadata.java, GenericRecordInputFormat.java, and GenericRecordTester.java.

What This Code Does

This code is analogous to the word count code, that must be familiar to anyone starting to learn MapReduce (since it seems to be the first example in all the documentation I've seen). Instead of counting words, this code counts the occurrence of values in the columns.

The code reads input files and produces output records with three columns:
  • A column name in the original data.
  • A value in the column.
  • The number of times the value appears.
Do note that for data with many unique values in many columns, the number of output records is likely to far exceed the number of input records. So, the output file can be bigger than the input file.

The input records are assumed to be in a text file with one record per row. The first row contains the names of the columns, delimited by a tab (although this could easily be changed to another delimiter). The rest of the rows contain values. Note that this assumes that the input files are all read from the beginning; that is, that a single input file is not split among multiple map tasks.

One irony of this code and the Hadoop framework is that the input files do not have to be in the same format. So, I could upload a bunch of different files, with different numbers of columns, and different column names, and run them all in parallel. I would have to be careful that the column names are all different, for this to work well.

Examples of such files are available on the companion page for my book Data Analysis Using SQL and Excel. These are small files by the standards of Hadoop (measures in megabytes) but quite sufficient for testing and demonstrating code.


Overview of Approach

There are four classes defined for this code:
  • GenericRecordMetadata stores the metadata (column names) for a record.
  • GenericRecord stores the values for a particular record.
  • GenericRecordInputFormat provides the interface for reading the data into Hadoop.
  • GenericRecordTester provides the functions for the MapReduce framework.
The metadata consists of the names of the columns, which can be accessed either by a column index or by a column name. The metadata has functions to translate a column name into a column index. Because it uses a HashMap, the functions should run quite fast, although they are not optimal in memory space. This is okay, because the metadata is stored only once, rather than once per row.

The generic record itself stores the data as an array of strings. It also contains a pointer to the metadata object, in order to fetch the names. The array of strings minimizes both memory overhead and time, but does require access using an integer. The other two classes are needed for the Hadoop framework.

One small challenge is getting this to work without repeating the metadata information for each row of data. This is handled by including the column names as the first row in any file created by the Hadoop framework, and not by putting the column names in the output for each row.


Setting Up The Metadata When Reading

The class GenericRecordInputFormat basically does all of its work in a private class called GenericRecordRecordReader. This function has two important functions: initialize() and nextKeyValue().

The initialize() function sets up the metadata, either by reading environment variables in the context object or by parsing the first line of the input file (depending on whether or not the environment variable genericrecord.numcolumns is defined). I haven't tested passing in the metadata using environment variables, because setting up the environment variables poses a challenge. These variables have to be set in the master routine in the configuration before the map function is called.

The nextKeyValue() function reads a line of the text file, parses it using the function split(), and sets the values in the line. The verification on the number of items read matching the number of expected items is handled in the function lineValue.set(), which raises an exception (currently unhandled) when there is a mismatch.


Setting Up The Metadata When Writing

Perhaps more interesting is the ability to set up the metadata dynamically when writing. This is handled mostly in the setup() function of the SplitReduce class, which sets up the metadata using various function calls.

Writing the column names out at the beginning of the results file uses a couple of tricks. First, this does not happen in the setup() function but rather in the reduce() function itself, for the simple reason that the latter handles IOException.

The second trick is that the metadata is written out by putting it into the values of a GenericRecord. This works because the values are all strings, and the record itself does not care if these are actually for the column names.

The third trick is to be very careful with the function GenericRecord.toString(). Each column is separated by a tab character, because the tab is used to separate the key from the value in the Hadoop framework. In the reduce output files, the key appears first (the name of the column in the original data), followed by a tab -- as put there by the Hadoop framework. Then, toString() adds the values separated by tabs. The result is a tab-delimited file that looks like column names and values, although the particular pieces are put there through different mechanisms. I imagine that there is a way to tell Hadoop to use a different character to separate the key and value, but I haven't researched this point.

The final trick is to be careful about the ordering of the columns. The code iterates through the values of the GenericRecord table manually using an index rather than a for-in loop. This is quite intentional, because it allows the code to control the order in which the columns appear -- which is presumably the original ordered in which they were defined. Using the for-in is also perfectly valid, but the columns may appear in a different order (which is fine, because the column names also appear in the same order).

The result of all this machinery is that the reduce function can now return values in a GenericRecord. And, I can specify these in the reduce function itself, without having to mess around with other classes. This is likely to be a big benefit as I attempt to develop more code using Hadoop.

Thursday, December 17, 2009

What do group members have in common?

We received the following question via email.

Hello,

I have a data set which has both numeric and string attributes. It is a data set of our customers doing a particular activity (eg: customers getting one particular loan). We need to find out the pattern in the data or the set of attributes which are very common for all of them.

Classification/regression not possible , because there is only one class
Association rule cannot take my numeric value into consideration
clustering clusters similar people, but not common attributes.


 What is the best method to do this? Any suggestion is greatly appreciated.

The question "what do all the customers with a particular type of loan have in common"  sounds seductively reasonable. In fact, however, the question is not useful at all because the answer is "Almost everything."  The proper question is "What, if anything, do these customers have in common with one another, but not with other people?"  Because people are all pretty much the same, it is the tiny ways they differ that arouse interest and even passion.  Think of two groups of Irishmen, one Catholic and one Protestant. Or two groups of Indians, one Hindu and one Muslim. If you started with members of only one group and started listing things they had in common, you would be unlikely to come up with anything that didn't apply equally to the other group as well.

So, what you really have is a classification task after all.  Take the folks who have the loan in question and an equal numbers of otherwise similar customers who do not. Since you say you have a mix of numeric and string attributes, I would suggest using decision trees. These can split equally well on numeric values ( x>n ) or categorical variables ( model in ('A','B','C') ). If the attributes you have are, in fact, able to distinguish the two groups, you can use the rules that describe leaves that are high in holders of product A as "what holders of product A have in common" but that is really shorthand for "what differentiates holders of product A from the rest of the world."

Tuesday, December 15, 2009

Hadoop 0.20: Creating Types

In various earlier posts, I wrote code to read and write zip code data (which happens to be part of the companion page to my book Data Analysis Using SQL and Excel). This provides sample data for use in my learning Hadoop and mapreduce.

Originally, I wrote the code using Hadoop 0.18, because I was using the Yahoo virtual machine. I have since switched to the Cloudera virtual machine, which runs the most recent version of Hadoop, V0.20.

I thought switching my code would be easy. The issue is less the difficulty of the switch, then some nuances in Hadoop and java. This post explains some of the differences between the two versions, when adding a new type into the system. I explained my experience with the map, reduce, and job interface in another post.

The structure of the code is simple. I have a java file that implements a class called ZipCode, which contains the ZipCode interface with the Writable interface (which is I include using import org.apache.hadoop.io.*). Another class called ZipCodeInputFormat implements the read/writable version so ZipCode can be used as input and output in MapReduce functions. The input format class uses another, private class called ZipCodeRecordReader, which does all the work. Because of the rules of java, these need to be in two different files, which have the same name as the class. The files are available in ZipCensus.java and ZipCensusInputFormat.java.

These files now use the Apache mapreduce interface rather than the mapred interface, so I must import the right packages into the java code:

import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.InputSplit;


And then I had a problem when defining the ZipCodeInputFormat class using the code:

public class ZipCensusInputFormat extends FileInputFormat {
....public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
........return new ZipCensusRecordReader();
....} // RecordReader
} // class ZipCensusInputFormat


The specific error given by Eclipse/Ganymede is: "The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files." This is a bug in Eclipse/Ganymede, because the code compiles and runs using javac/jar. At one point, I fixed this by including various Apache commons jars. However, since I didn't need them when compiling manually, I removed them from the Eclipse project.

The interface for the RecordReader class itself has changed. The definition for the class now looks like:

class ZipCensusRecordReader extends RecordReader

Previously, this used the syntax "implements" rather than "extends". For those familiar with java, this is the difference between an interface and an abstract class, a nuance I don't yet fully appreciate.

The new interface (no pun intended) includes two new functions, initialize() and cleanup() . I like this change, because it follows the same convention used for map and reduce classes.

As a result, I changed the constructor to take no arguments. This has moved to initialize(), which takes two arguments of type InputSplit and TaskAttemptContext. The purpose of this code is simply to skip the first line of the data file, which contains column names.

The most important for the class is now called nextKeyValue() rather than next(). The new function takes no arguments, putting the results in local private variables accessed using getCurrentKey() and getCurrentValue(). The function next() took two arguments, one for the key and one for the value, although the results could be accessed using the same two functions.

Overall the changes are simple modifications to the interface, but they can be tricky for the new user. I did not find a simple explanation for the changes anywhere on the web; perhaps this posting will help someone else.

Saturday, December 5, 2009

Hadoop and MapReduce: What Country is an IP Address in?

I have started using Hadoop to sessionize web log data. It has surprised me that there is not more written on this subject on the web, since I thought this was one of the more prevalent uses of Hadoop. Because I'm doing this work for a client, using Amazon EC2, I do not have sample data web log data files to share.

One of the things that I want to do in the sessionization code is to include what country the user is in. Typically, the only source of location information in such logs is the IP address used for connecting to the internet. How can I look up the country the IP address is in?

This posting describes three things: the source of the IP geography information, new things that I'm learning about java, and how to do the lookup in Hadoop.


The Source of IP Geolocation Information

MaxMind is a company that has a specialty in geolocation data. I have no connection to MaxMind, other than a recommendation to use their software from someone at the client where I have been doing this work. There may be other companies with similar products.

One way they make money by offering a product called GeoIp Country which has very, very accurate information about the country where an IP is located (they also offer more detailed geographies, such as regions, states, and cities, but country is sufficient for my purposes). Their claim is that GeoIP Country is 99.8% accurate.

Although quite reasonably priced, I am content to settle for the free version, called GeoLite Country, for which the claim is 99.5% accuracy.

These products come in two parts. The first part is an interface, which is available for many languages, with the java version here. I assume the most recent version is the best, although I happen to be using an older version.

Both the free and paid versions use the same interface, which is highly convenient, in case I want to switch between them. The difference is the database, which is available from this download page. The paid version has more complete coverage and is updated more frequently.

The interface consists of two important components:
  • Creating a LookupService object, which is instantiated with an argument that names the database file.
  • Using LookupService.getCountry() to do the lookup.
Simple enough interface; how do we get it to work in java, and in particular, in java for Hadoop?


New Things I've Learned About Java

As I mentioned a few weeks ago in my first post on learning Hadoop, I had never used java prior to this endavor (although I am familiar with other object oriented programming languages such as C++ and C#). I have been learning java on an "as needed" basis, which is perhaps not the most efficient way overall but has been the fastest way to get started.

When programming java, there are two steps. I am using the javac command to compile code into class files. Then I'm using the jar command to create a jar file. I have been considering this the equivalent of "compiling and linking code", which also takes two steps.

However, the jar file is much more versatile than a regular executable image. In particular, I can put any files there. These files are then available in my application, although java calls them "resources" instead of "files". This will be very important in getting MaxMind's software to work with Hadoop. I can include the IP database in my application jar file, which is pretty cool.

There is a little complexity, though, which involves the paths of where there are located. When using hadoop, I have been using statements such as "org.apache.hadoop.mapreduce" without really understand them. This statement brings in classes associated with the mapreduce package, because three things have happened:
  • The original work (at apache) was done in a directory structure that included ./org/apache/hadoop/mapreduce.
  • The tar file was created in that (higher-level) directory. Note that this could be buried deep down in the directory hierarchy. Everything is relative to the directory where the tar file is created.
  • I am including that tar file explicitly in my javac command, using the -cp argument which specifies a class path.
All of this worked without my having to understand it, because I had some examples of working code. The MaxMind code then poses a new problem. This is the first time that I have to get someone else's code to work. How do we do this?

First, after you uncompress their java code, copy the com directory to the place where you create your java jar file. Actually, you could just link the directories. Or, if you know what you are doing, then you may have another solution.

Next, for compiling the files, I modified the javac command line, so it read: javac -cp .:/opt/hadoop/hadoop-0.20.1-core.jar:com/maxmind/geoip [subdirectory]/*.java. That is, I added the geoip directory to the class path, so java can find the class files.

The class path can accept either a jar file or a directory. When it is a jar file, javac looks for classes in the jar file. When it is a directory, it looks for classes in the directory (but not in subdirectories). That is simple enough. I do have to admit, though, that it wasn't obvious when I started. I don't think of jar files and directories as being equivalent. But they are.

Once the code compiles, just be sure to include the com/maxmind/geoip/* files in the jar command. In addition, I also copied over the GeoLite Country database and included it in the jar file. Do note that the path used to put things in the jar file makes a difference! So, "jar ~/maxmind/*.dat" behaves differently from "jar ./*.dat", when we want to use the data file.


Getting MaxMind to Work With Hadoop

Things are a little complicated in the Hadoop world, because we need to pass in a database file to initialize the MaxMind classes. My first attempt was to initialize the lookup service in the map class using code like:

iplookup = new LookupService("~/maxmind/GeoIP.dat",
.............................LookupService.GEOIP_MEMORY_CACHE |
.............................LookupService.GEOIP_CHECK_CACHE);


This looked right to me and was similar to code that I found in various placed on the internet.

Guess what? It didn't work. And it didn't work for a fundamentally important reason. Map classes are run on the distributed nodes, and the distributed nodes do not have access to the local file system. Duh, this is why the HDFS (hadoop distributed file system) was invented!

But now, I have a problem. There is a reasonably sized data file -- about 1 Mbyte. Copying it to the HDFS does not really solve my problem, because it is not an "input" into the Map routine. I suppose, I could copy it and then figure out how to open it as a sequence file, but that is not the route I took.

Up to this point, I had found three ways to get information into the Map classes:
  1. Compile it in using constants.
  2. Pass small amounts on the Conf structure, using the various set and get functions. I have examples of this in the row number code.
  3. Use the distributed cache. I haven't done this yet, because there is warning about setting it up correctly using configuration xml files. Wow, that is something that I can easily get wrong. I'll learn this when I think it is absolutely necessary, knowing that it might take a few hours to get it right.
But now, I've discovered that java has an amazing fourth way: I can pass files in through the jar file. Remember, when we use Hadoop, we call a function "setJarbyClass()". Well, this function takes the class that is passed in and sends the entire jar file with the class to each of distributed nodes (for both the Map and Reduce classes). Now, if that jar file just happens to contain a data file with ip address to country lookup data, then java has conspired to send my database file exactly where it is needed!

Thank you java! You solved this problem. (Or, should I be thanking Hadoop?)

The only question is how to get the file out of the jar file. Well, the things in the jar file are called "resources". Resources are accessed using uniform resource identifiers (URI). And, the URI is conveniently built out of the file name. Life is not so convenient that the URI is the file name. But, it is close enough. The URI prepends the file name with something (say, "http:").

So, to get the data file out of the jar file (which we put in using the jar command), we need to:
  • figure out the name for the resource in the jar file;
  • convert the resource name to a file name; and then,
  • open this just as we would a regular file (by passing it into the constructor).
The code to do this is:

import com.maxmind.geoip;
...
if (iplookup == null) {
....String filename = getClass().getResource("/GeoIP.dat").toExternalForm().substring(5);
....iplookup = new LookupService(filename, LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

The import tells the java code where to find the LookupService class. To make this work, we have to include the appropriate directory in the class path, as described earlier.

The first statement creates the file name. The resource name "/GeoIP.dat" says that the resource is a file, located in the directory where the tar file was created. The rest of the statement converts this to a file name. The function "toExternalForm()" creates a URI, which is the filename prepended with something. The substring(5) removes the something (I didn't look, but wouldn't be surprised if it were "http:"). The original example code I found had substring(6), which did not work for me on EC2.

The second statement passes this into the lookup service constructor.

Now the lookup service is available, and I can use it via this code:

this.ipcountry = iplookup.getCountry(sale.ip).getCode();

Voila! From the IP address, I am able to use free code downloaded from the internet to lookup the IP address using the distributed power of Hadoop.