Tuesday, August 26, 2008

MapReduce Functionality in Commercial Databases

If you use LinkedIn, then you have probably been impressed by their "People you may know" feature. I know that I have. From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.

LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on. We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).

However, trying to imagine how they might determine this information using SQL is more challenging. SQL provides the ability to store a graph of connections. However, traversing the graph is rather complicated in standard SQL. Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.

It is not surprising, then, that they are using MapReduce to develop this information. The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities. The commercial database software that supports both is provided by a company called Greenplum.

Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine. Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well. This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.

I do want to point out that the integration of MapReduce is at the programming level. In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.

As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths. MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access. On the other hand, MapReduce has functionality that is not present in SQL databases.

Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask: Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language? What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?

Monday, August 11, 2008

Sorting Cells in Excel Using Formulas

This post describes how to create a new table in Excel from an existing table, where the cells are sorted, and to do this using only formulas. As a result, modifying a value in the original table results in rearranging the sorted table. And, this is accomplished without macros and without using the "sort" menu option.

The material in this post is generalized in another post. Also, if you are interested in this post, you may be interested in my book Data Analysis Using SQL and Excel.

Consider a table with two columns:





The sorted table would be automatically calculated as:





If the first value were changed to 5, then the sorted table would automatically recalculate as:





There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.

A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.

The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.

Three Excel functions are the key to the idea: RANK(), MATCH(), and OFFSET(). The first function ranks numbers in a list. The second allows us to use this info to sort the list.

The following shows the effect of the RANK() function:






The function itself looks like:


15B=RANK(C5, $C$5:$C$8, 1)

14D=RANK(C6, $C$5:$C$8, 1)

13C=RANK(C7, $C$5:$C$8, 1)

11A=RANK(C8, $C$5:$C$8, 1)

The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.

The following shows the formulas for the table:


1=OFFSET(C$4, MATCH(C11, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C11, $E$5:$E$8, 0), 0)

2=OFFSET(C$4, MATCH(C12, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C12, $E$5:$E$8, 0), 0)

3=OFFSET(C$4, MATCH(C13, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C13, $E$5:$E$8, 0), 0)

4=OFFSET(C$4, MATCH(C14, $E$5:$E$8, 0), 0)=OFFSET(D$4, MATCH(C14, $E$5:$E$8, 0), 0)

The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function MATCH(C11, $E$5:$E$8, 0) simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The OFFSET() function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.

The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.

The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the MATCH() function. Finally, the OFFSET() function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.

Tuesday, August 5, 2008

B2B Data Mining

Last week, I travelled to Malaysia to apply data mining techniques to business-to-business (B2B) problems. More typically, the work we do at data miners is in the world of business-to-consumer (B2C) , since a company with millions of customers is more likely to need our services. However, a certain gentleman read my book Data Analysis Using SQL and Excel and wanted to apply the ideas to his business.

In this case, the business is a large computer hardware business, that sells complex customized systems to customers. In fact, the systems are so complex that they have a special organization to configure the systems. The joke is that this allows the sales force to spend more time on the golf course. More seriously, it reduces configuration mistakes and helps ensure that the end customers receive the right system.

The data that we worked with was primarily sales data. The data is not tiny; several tables contain millions of rows, and the whole database probably occupies about ten gigabytes of data. However, the data structure is more complex than in a typical B2C database.

First, the important item of data is a quote. This is a configuration of hardware, software, networking equipment, and services provided to the customer. Some quotes result in orders; many do not. Each customer typically has many quotes, and several employees work on each quote. However, typically only one of those employees (let's call that person the sales rep) is credited with the sale.

A quote contains lists of items. However, these items are at the most detailed level (let's call it the UPC -- universal product code -- level). The hierarchy of products is important, particularly the two highest levels.

The highest level is the product category -- hardware, software, and so on. The next highest level is the catalog-listing level. That is, the product name in the catalog. For instance, I am writing this on an old Dell Precision M60 -- that is the "catalog" name. However, the specific disk size, number of megabytes, peripherals, and software might differ from computer to computer. Does it have a 40 Gig disk or an 80 Gig disk? And so on.

What type of questions might be answered by a history of customer quotes, with associated dimension information?

One question is the "quote-to-close" question. This is a simple time-to-event problem of how long from when a quote is created to when the quote results in an order. Generally, companies in this business look at the close rate for quotes, but not the timing of the close. However, the timing provides important information, particularly when it is changing over time.

A related question is what affects the quote-to-close time? Complex quotes with many different components might lengthen the time. Similarly, large discounts that require more levels of approval might lengthen the time. In this case, we were surprised that certain small quotes had a very long time. These small quotes had large percentage discounts, and were often paired with larger orders. The large percentage discount required more levels of approval -- a sign that the company is spending too much effort in the wrong place.

Such data also suggests market basket analysis. "What products are purchased together?" is not quite the right question. Instead, the question is "What configurations are most common?"

Common configuations lead to additional questions. Which configurations (if any) are associated with longer quote-to-close times? Which configuraitons are associated with longer install times? Some products are readily available, some have a longer lead time, and some are built-to-order. The logistics database is not necessarily available for quotes. And, the historical logistics database (what is the lead time for products at a given point in the past) probably does not even exist.

Another genre of question is related to purchase velocity. After a customer makes a purchase, when does the next purchase take place? In this business, purchases can be one of several types. A new purchase represents new equipment; an upgrade is adds additional components to existing equipment; and a replacement replaces one model for another. The pattern of purchases is interesting, both at the customer level and the sales rep level. Some sales reps may be focusing on upgrade business (which is relatively easy, since the customer already has the product), and not expanding into new customers.

A related idea to purchase velocity is to categorize customers by the time since their most recent purchase. What volume of purchases are made by customers who purchased something in the previous month? In the previous twelve months? By year? Most customers make at least one purchase per year. For the dormant customers who return, what caused them to come back.

B2B data is more complicated than B2C, because purchases are more complex and customers are organizations with multiple sites and contacts. Often, people are paid to understand the customers and to meet their needs. However, sales reps understand individual customers. There is ample opportunity for data mining to learn patterns about customers in general and to learn what works and does not work ijn particular circumstances.