Sunday, December 27, 2009

Hadoop and MapReduce: Characterizing Data

This posting describes using Hadoop and MapReduce to characterize data -- that is, to summarize the values in various columns to learn about the values in each column.

This post describes how to solve this problem using Hadoop. It also explains why Hadoop is better for this particular problem than SQL.

The code discussed in this post is available in these files:,,, and This work builds on the classes introduced in my previous post Hadoop and MapReduce: Method for Reading and Writing General Record Structures (the versions here fix some bugs in the earlier versions).

What Does This Code Do?

The purpose of this code is to provide summaries for data in a data file. Being Hadoop, the data is stored in a delimited text format, with one record per line, and the code uses GenericRecord to handle the specific data. The generic record classes are things that I wrote to handle this situation; the Apache java libraries apparently have other approaches to solving this problem.

The specific summaries for each column are:
  • Number of records.
  • Number of values.
  • Minimum and maximum values for string variables, along with the number of times the minimum and maximum values appear in the data.
  • Minimum and maximum lengths for string variables, along with the number of times these appear and an example of the value.
  • First, second, and third most common string values.
  • Number of times the column appears to be an integer.
  • Minimum and maximum values when treating the values as integers, along with the number of times that these appear.
  • Number of times the column appears to contain a real number.
  • Minimum and maximum values when treating the values as doubles, along with the number of times that these appear.
  • Count of negative, zero, and positive values.
  • Average value.
These summaries are arbitrary. The code should be readily extensible to other types and other summaries.

My ultimate intention is to use this code to easily characterize input and result files that I create in the process of writing Hadoop code.

Overview of the Code

The characterize problem is solved in two steps. The first creates a histogram of all the values in all the columns, and the second summarizes the histogram of values, which is handled by two passes of map reduce.

The histogram step takes files with the following format:
  • Key: undetermined
  • Values: text values separated by a delimited (by default a tab)
(This is the GenericRecord format.)
The Map phase produces a file of the format:
  • Key: column name and column value, separated by a colon
  • Value: "1"
Combine and Reduce then add up the "1"s, producing a file of the format:
  • Key: column name
  • Value: column value separated by tab
Using a tab as a separator is a convenience, because this is also the default separator for the key.

The second phase of the Map/Reduce job takes the previous output and uses the reduce function to summarize all the different values in the histogram. This code is quite specific to the particular summaries. The GenericRecord format is quite useful because I can simply add new summaries in the code, without worrying about the layout of the records.

The code makes use of exception processing to handle particular data types. For instance, the following code block handles the integer summaries:

try {
....long intval = Long.parseLong(valstr);
....hasinteger = true;
....intrecordcount += Long.parseLong(val.get("count"));
catch (Exception NumberFormatException) {
....// we don't have to do anything here

This block tries to convert the value to an integer (actually to a long). When this works, then the code updates the various variables that characterize integer values. When this fails, the code continues working.

There is a similar block for real numbers, and I could imagine adding more such blocks for other formats, such as dates and times.

Why MapReduce Is Better Than SQL For This Task

Characterizing data is the process of summarizing data along each column, to get an idea of what is in the data. Normally, I think about data processing in terms of SQL (after all, my most recent book is Data Analysis Using SQL and Excel). SQL, however, is particularly poor for this purpose.

First, SQL has precious few functions for this task -- basically MIN(), MAX(), AVG() and judicious use of the CASE statement. Second, SQL generally has lousy support for string functions and inconsistent definitions for date and time functions across different databases.

Worse, though, is that traditional SQL can only summarize one column at a time. The traditional SQL approach would be to summarize each column individually in a query and then connect them using UNION ALL statements. The result is that the database has to do a full-table scan for each column.

Although not supported in all databases, SQL syntax does now support the GROUPING SETS keyword which helps potentially alleviate this problem. However, GROUPING SETS is messy, since the key columns each have to be in separate columns. That is, I want the results in the format "column name, column value". With GROUPING SETS, I get "column1, column2 ... columnN", with NULLs for all unused columns, except for the one with a value.

The final problem with SQL occurs when the data starts out in text files. Much of the problem of characterizing and understanding the data happens outside the database during the load process.

No comments:

Post a Comment

Your comment will appear when it has been reviewed by the moderators.