Saturday, May 3, 2008

Adjusting for oversampling

A few days ago, a reader of this blog used the "ask a data miner" link on the right to mail us this question. (Or, these questions.)


Question:

When modeling rare events in marketing, it has been suggested by many to take a sample stratified by the dependent variable(s) in order to allow the modeling technique a better chance of detecting a difference (or differences in the case of k-level targets). The guidance for the proportion of the event in the sample seems to range between 15-50% for a binary outcome (no guidance have I seen for a k-level target). I am confused by this oversampling and have a couple questions I am hoping you can help with:

  1. Is there a formula for adjusting the predicted probability of the event (s) when there has been oversampling on the dependent?
  2. Is this correction only needed when ascertaining the lift of the model or comparing to other models which were trained on a dataset with the same oversampling proportion OR would you need the adjustment to be done to the predicted value when you score a new dataset – such as when you train a model on a previous campaign and use the model to score the candidates for the upcoming campaign?
  3. I use logistic regression and decision trees for classification of categorical dependent variables – does the adjustment from (question 1) apply to both of these? Does the answer to (question 2) also apply to both of these techniques?


Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.

To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the oversampling rate is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, Mastering Data Mining. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called backfitting the model.

To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.

Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written several papers on the subject.

4 comments:

  1. In part I disagree with :"If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time!" This is correct if you are looking for the non-fraudulent claims. Usually with unbalanced data you are looking for the sparse class, the fraudulent claims. In that case I have good experiences on datasets with as few as 0.5% targets. I do not care about the 99%, but about the leaves of the tree where 10% or 20% or 50% consist of rare cases. If you oversample to a 50-50 situation, you loose too much information (unless the rare class numbers in the tens of thousends) In tests with decision trees I found that augmenting the proportion of the abundant class allways enhances the capability of the model to identify the rare class.

    ReplyDelete
  2. With respect to the reference to Chapter 7, I think it is fair to point out a few imprecisions in that chapter. The OSR computed on p. 201 is not correct. Instead it should be 42.43 (0.99*0.3/(0.01*0.7)). This implies that the predicted density becomes 0.175 (noticing the obvious typo in the formula) corresponding to a lift of 17.5.

    ReplyDelete
  3. Does anyone have experience oversampling (all positive rare cases, sample common negative cases) using SVM? Specifically like this question how to adjust the probabilities back again?

    ReplyDelete
  4. There is some evidence that straightforward oversampling for SVMs does not buy you much,

    The Effect of Oversampling and Undersampling on Classifying ...

    unless you use some very fancy oversampling techniques that generate synthetic data points from the minority class. This seems to also jibe with the intuition behind SVMs with fairly rigid margins (large values for the cost constant C).

    SVMs construct decision surfaces based on the convex hulls of the classes, it does not matter how many data points are contained within the hulls (many points for the majority class, few points minority class). Straightforward oversampling of the minority class simply replicates the points within the hull of the minority class and therefore does not buy you much from an SVM point of view. (For an introduction to this geometric interpretation of SVMs see Bennett & Colin's paper
    Support Vector Machines: Hype or Hallelujah?)

    If you employ an SVM with soft-margins (small values for C), then you allow the SVM to make a certain amount of classification errors. In this case all points of the minority class might be considered classification errors and you could obtain a non-sensical model in the sense mentioned in the blog.

    The value of C depends highly on the kind of kernel you choose. The more complex the kernel, the more readily the data might become separable in kernel space necessitating a lower value for C for perfect classification and vice versa.

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.