This posting describes data normalization, explains how I accomplished it in Hadoop/MapReduce, and some tricks in the code. I should emphasize here that the code is really "demonstration" code, meaning that I have not worked hard on being sure that it always works. My purpose is to demonstrate the idea of using Hadoop to do normalization, rather than producing 100% working code.
What is Normalization and Why Do Want To Do It?
Data normalization is the process of extracting values from a single column and placing them in a reference table. The data used by Hadoop is typically unnormalized, meaning that data used in processing is in a single record, so there is no need to join in reference tables. In fact, doing a join is not obvious using the MapReduce primitives, although my understanding is that Hive and Pig -- two higher level languages based on MapReduce -- do incorporate this functionality.
Why would we want to normalize data? (This is a good place to plug my book Data Analysis Using SQL and Excel, which explains this concept in more detail in the first chapter.) In the relational world, the reason is something called "relational integrity", meaning that any particular value is stored in one, and only one, place. For instance, if the state of California were to its name, we would not want to update every record from California. Instead, we'd rather go to the reference table and just change the name to the new name, and the data field contains a state id rather than the state name itself. Relational integrity is particularly important when data is being updated.
Why would we want to normalize data used by Hadoop? There are two reasons. The first is that we may be using Hadoop processing to load a relational database -- one that is already designed with appropriate reference tables. This is entirely reasonable, relational databases are an attractive way to "publish" results from complex data processing since they are better for creating end-user reports and building interactive GUI interfaces.
The second reason is performance. Extracting long strings and putting them in a separate reference table can significantly reduce the storage requirements for the data files. By far, most of the space taken up in typical log files, for instance, consists of long URIs (what I used to call URLs). When processing the log files, we might want to extract some features from the URIs, but keeping the entire string just occupies a lot of space -- even in a compressed file.
The Process of Normalizing Data
Normalizing data starts with data structures. The input records are assumed to be in a delimited format, with the column names in the first row (or provided separately, although I haven't tested that portion of the code yet). In addition, there is a "master" id file that contains the following columns:
- id -- a unique id for every value by column.
- column name -- the name of the column.
- value -- the id in the column.
- count -- the total number of times the value as so far occurred.
What happens when we normalize data? Basically, we look through the data file to find new values in each column being normalized. We append these new values into the master id file, and then go back to the original data and replace the values with the ids.
Hadoop is a good platform for this for several reasons. First, because the data is often stored as text files, the values and the ids have the same type -- text strings. This means that the file structures remain the same. Second, Hadoop can process multiple columns at the same time. Third, Hadoop can use inexpensive clusters and free software for this task, rather than relying on databases and tools, which are often more expensive.
How To Normalize Data Using Hadoop/MapReduce
The normalization process has six steps. Most of these correspond to a single Map-Reduce pass.
Step 1: Extract the column value pairs from the original data.
This step explodes the data, by creating a new data set with multiple rows for each row in the original data. Each output row contains a column, a value, and the number of times the value appears in the data. Only columns being normalized are included in the output.
This step also saves the column names for the data file in a temporary file. I'll return to why this is needed in Step 6.
Step 2: Extract column-value Pairs Not In Master ID File
This step compares the column-value pairs produced in the first step with those in the master id file. This step is interesting, because it reads data from two different data source formats -- the master id file and the results from Step 1. Both sets of data files use the GenericRecord format.
To identify the master file, the map function looks at the original data to see whether "/master" appears in the path. Alternative methods would be to look at the GenericRecord that is created or to use MultipleInputs (which I didn't use because of a warning on Cloudera's web site).
Step 3: Calculate the Maximum ID for Each Column in the Master File
This is a very simple Map-Reduce step that simply gets the maximum id for each column. New ids that are assigned will be assigned one more than this value.
This is an instance where I would very much like to have two different reduces following a map step. If this were possible, then I could combine this step with step 2.
Step 4: Calculate a New ID for the Unmatched Values
This is a two step process that follows the mechanism for adding row numbers discussed in one of my earlier posts, with one small modification. The final result has the maximum id value from Step 3 added onto it, so the result is a new id rather than just a row number.
Step 5: Merge the New Ids with the Existing Master IDs
This step merges in the results from Step 4 with the existing master id file. Currently, the results are placed into another directly. Eventually, they could simply override the master id file.
Because of the structure of the Hadoop file system, the merge could be as simple as copying the file with the new ids into the appropriate master id data space. However, this would result in an unbalanced master id file, which is probably not desirable for longer term processing.
Step 6: Replace the Values in the Original Data with IDs
This final step replaces the values with ids -- the actual normalization step. This is a two part process. The map phase of the first part takes both the original data and the master key file. All the column value pairs are exploded from the original data, as in Step 1, with the output consisting of:
- value: <"expect"|"nomaster">,
The master records are placed in the format:
- value: "master",
The last phase simply puts the records back together again, from their exploded form. The one trick here is that the metadata is read from a local file.
Tricks Used In This Code
The code is available in these files: Normalize.java, GenericRecordInputFormat.java, GenericRecord.java, and GenericRecordMetadata.java. This code uses several tricks along the way.
One trick that I use in Step 4, for the phase 1 map, makes the code more efficient. This phase of the computation extracts the maximum row number for each column. Instead of passing all the row numbers to a combine or reduce function, it saves them in a local hash-map data structure. I then use the cleanup() routine in the map function to output the maximum values.
Often the master code needs to pass variables to the map/reduce jobs. The best way to accomplish this is by using the "set" mechanism in the Configuration object. This allows variables to be assigned a string name. The names of all the variables that I use are stored in constants that start with PARAMETER_, defined at the beginning of the Normalize class.
In some cases, I need to pass arrays in, for instance, when passing in the list of column that are to be normalized. In this case, one variable gives the number of values ("normalize.usecolumns.numvals"). Then each value is stored in a variable such as "normalize.usecolumns.0" and "normalize.usecolumns.1" and so on.
Some of the important processing actually takes place in the master loop, where results are gathered and then passed to subsequent steps using this environment mechanism.
The idea behind the GenericRecord class is pretty powerful, with the column names at the top of the file. GenericRecords make it possible to read multiple types of input in the same map class, for instance, which is critical functionality for combining data from two different input streams.
However, the Map-Reduce framework does not really recognize these column names as being different, once generic records are placed in a sequence file. The metadata has to be passed somehow.
When the code itself generates the metadata, this is simple enough. A function is used to create the metadata, and this function is used in both the map and reduce phases.
A bigger problem arises with the original data. In particular, Step 6 of the above framework re-creates the original records, but it has lost the column names, which poses a conundrum. The solution is to save the original metadata in Step 1, which first reads the records. This metadata is then passed into Step 6.
In this code, this is handled by simply using a file. The first map partition of Step 1 writes this file (this partition is used to guarantee that the file is written exactly once). The last reduce in Step 6 then reads this file.
This mechanism works, but is not actually the preferred mechanism, because all the reduce tasks in Step 6 are competing to read the same file -- a bottleneck.
A better mechanism is for the master program to read the file and to place the contents in variables in the jar file passed to the map reduce tasks. Although I do this for other variables, I don't bother to do this for the file.