Sunday, May 10, 2009

Not Enough Data

An article in yesterday's New York Times reminded me of examples of "bad" examples of data mining. By bad examples, I mean that spurious correlations are given credence -- enough credence to make it into a well-reputed national newspaper.

The article, entitled "Eat Quickly, for the Economy's State" is about a leisure time report from the OECD that shows a correlation between the following two variables:
  • Change in real GNP in 2008; and,
  • Amount of time people spend eating and drinking in a given day.
The study is based on surveys from 17 countries (for more information on the survey, you can check this out).

The highlight is a few charts that shows that countries such as Mexico, Canada, and the United States have the lowest time spent eating (under 75 minutes per day) versus countries such as New Zealand, France, and Japan (over 110 minutes per day). The first group of countries have higher growth rates, both in 2008 and for the past few years.

My first problem with the analysis is one of granularity. Leisure time is measured per person, but GNP is measured over everyone. One big component of GNP growth is population growth, and different countries have very different patterns of population growth. The correct measure would be per capital GNP. Taking this into account would dampen the GNP growth figures for growing countries such as Mexico and the United States, and increase the GNP growth figures for lesser growing (or shrinking countries) such as Italy, Germany, and Japan.

Also, the countries where people eat more leisurely have other characteristics in common. In particular, they tend to have older populations and lower (or even negative) rates of population growth. One wonders if speed eating is a characteristic of younger people and leisurely eating is a characteristic of older people.

The biggest problem, though, is that this is, in all likelihood, a spurious correlation. One of the original definitions of data mining, which may still be used in the ecoonomics and political world, is a negative one: data mining is looking for data to support a conclusion. The OECD surveys were done in 17 different countries. The specific result in the NYT article is "Counties in which people eat and drink less than 100 minutes per day grow 0.9% faster -- on average -- than countries in which people each and drink more than 100 minutes per day".

In other words, the 17 countries were divided into two groups, and the growth rates were then measured for each group. Let's look at this in more detail.

How many ways are there to divide 17 countries into 2 groups? The answer is 2^17 = 131,072 different ways (any particular country could be in either group). So, if we had 131,072 yes-or-no survey questions, then would would expect any combination to arise, including the combinations where all the high growth countries are in one group and all the low growth countries in the other. (I admit the exact figure is a bit more than 131,072 but that is unimportant to illustrate my point.)

The situation actually gets worse. The results are not yes-or-no; they are numeric measurements which are then used to split the countries into two groups. The splits could be at any value of the measure. So, any given measurement results in 17-1=16 different possible splits (the first group having the country with the lowest measurement, with the two lowest, and so on). Now we only need about 8,192 uncorrelated measurements to get all possibilities.

However, we do not need all possibilities. A glance at the NYT article shows that the country with the worst 2008 growth is Poland, yet it is in the fast-eating group. And Spain -- in the slow eating group -- is the third fastest growing economy (okay, its GNP actually shrank but less than most others). So, we only need an approximation of a split, where the two groups look different. And then, voila! we get a news article.

The problem is that the OECD was able to measure dozens or hundreds of different things in their survey. My guess is that measures such as "weekly hours of work in main job," "time spent retired," and "time spent sleeping" -- just a few of the many possibilities -- did not result in interesting splits. Eventually, though, a measure such as "time spent eating and drinking" results in a split where the different groups look "statistically significant" but they probably are not. If the measure is interesting enough, then it can become an article in the New York Times.

This is probably a problem with statistical significance. The challenge is that a p-value of 0.01 means that something has only a 1% chance of happening at random. However, if we look at 100 different measures, then there is a really, really good chance that one of them will have a p-value of 0.01 or less. By the way, there is a statistical adjustment called the Bonferroni correction to take this into account (this as well as others are described in the Wikipeida).

Fortunately, neither the OECD nor the New York Times talk about this discovery as an example of data mining. It is just poor data analysis, but poor data analysis that can re-enforce lessons in good data analysis. Lately, I have been noticing more examples of articles such as this, where researchers -- or perhaps just journalists -- extrapolate from very small samples to make unsupported conclusions. These are particularly grating when they appear in respected newspapers, magazines, and journals.

Data mining is not about finding spurious correlations and claiming some great discovery. It is about extracting valuable information from large quantities of data, information that is stable and useful. Smaller amounts of data often contain many correlations. Often, these correlations are going to be spurious. And without further testing, or at least a mechanism to explain the correlation, the results should not be mentioned at all.


  1. Nice post, Gordon. I couldn't agree more with your observations in general. However, in this specific example, I am struck by an inconsistency that I can't quite explain.

    if the response to each question in the survey is considered as a predictor, and we run a decision tree algorithm with GNP growth rate as the target variable (i.e. the LHS), wouldn't the algorithm decide on the first branch of the tree doing exactly what the reporter must have done?

    That is, wouldn't each split of every variable be tested to see which variable+split leads to the best separation in growth rates? In some sense, the reporter stopped with the first branch, but the decision tree algorithm would have continued till some stopping criterion was reached.

    If the reporter was merely doing what decision-trees do, why is one "bad" and the other "good"?

  2. The issue is that there are only 17 rows being investigated, one for each country. This is too small a number for building a decision tree.

    The 17 rows are summarized from more voluminous data (either the surveys or country-wide statistics). This does not change the number of rows.

    One could imagine running a decision tree on the original survey data. First, that is not what was done to achieve the result. Second, an international survey introduces all sorts of problems, because of language, culture, sampling, and interpretations of questions. It would be challenging to derive solid conclusions from such a set of surveys.

  3. 17 rows! I didn't realize that it was so small. That explains it.


Your comment will appear when it has been reviewed by the moderators.