Monday, September 21, 2009

Data Mining and Statistics

We recently received the following question from a reader. (Hi Brad!)

What is the difference between Data Mining and statistics and should I care?
The way I think about it, data mining is the process of using data to figure stuff out. Statistics is a collection of tools used for understanding data. We explicitly use statistical tools all the time to answer questions such as "is the observed change in conversion rate (or response, or order size, . . .) significant or might it be just due to chance?" We also use statistics implicitly when, for example, a chi-square test inside a decision tree algorithm decides which of several candidate splits will make it into a model. When I make a histogram showing number of orders by order size bin, I am really exploring a distribution although I may not choose to describe it that way. So, data miners use statistics and much of what statisticians do might be called data mining.

There is, however, a cultural difference between people who call themselves statisticians and people who call themselves data miners. This difference has its origins in different expectations about data size. Statistics grew up in an era of small data and many statisticians still live in that world. There are strong practical and budgetary limits to how many patients you can recruit for a clinical trial, for instance. Statisticians have to extract every last drop of information from their small data sets and so they have developed a lot of clever tools for doing that. Data Miners tend to live in a big data world. With big data, we can often replace cleverness with more data. Gordon's most recent post on oversampling is an example. If you have sufficient data that you can throw away most of the common cases and still have enough data to work with, that really is easier than keeping track of lots of weights. Similarly, with enough data, it is much easier (and more accurate) to estimate the probability that a subscriber will cancel with tenure of 100 days by counting the many people who do quit and dividing by the even larger number of people who could have quit but didn't, than to make some assumptions about the shape of the hazard function.


  1. Thanks. This was a good post.

  2. Excelent post! I have to agree... the Data Miners lives in a big data world,and more, it can extract small and relevant data from this big world and can make the same assumptions when it's applied with the all data.

  3. To me there is really a more clear distinction : statistics are used to test hypotheses, whereas data mining is used to calculate the hypotheses.
    Suggestion : collect data on statisticians an data miners (what they do, their data etc...) and run a logistic regression or decision tree, to see which variables separates them.

  4. The statistics you describe is qualitative statistics. But there has always been quantitative statistics dealing with large amounts of data as well, apart from data mining.

  5. To me the difference lies in the assumptions of the technique. Statistics make more assumptions about the data than data mining. For example, a linear regression makes the assumption that the error term is normally distributed with a constant variance. This assumption (among others) is needed in order for the confidence intervals to be efficient and unbiased.
    Standard data mining techniques, such as Decision Trees or association rules, make much less assumptions about the data. This makes data mining much more flexible, but in my opinion also less precise to generalize.

    However, I don't believe a clear cut exists!

  6. Yes, statisticians make assumptions, but that's why there's model validation and residual analysis. To me, a big difference between "data miners" and statisticians, especially academic statisticians, is the emphasis placed on model analysis, residual analysis, and validation of assumptions. Usually I find that data miners think they are really doing a lot to validate their models, but they really aren't. In my opinion, data miners rely too much on only a couple of small model tests (lift, confusion matrix, etc.).


Your comment will appear when it has been reviewed by the moderators.