When modeling rare events in marketing, it has been suggested by many to take a sample stratified by the dependent variable(s) in order to allow the modeling technique a better chance of detecting a difference (or differences in the case of k-level targets). The guidance for the proportion of the event in the sample seems to range between 15-50% for a binary outcome (no guidance have I seen for a k-level target). I am confused by this oversampling and have a couple questions I am hoping you can help with:
- Is there a formula for adjusting the predicted probability of the event (s) when there has been oversampling on the dependent?
- Is this correction only needed when ascertaining the lift of the model or comparing to other models which were trained on a dataset with the same oversampling proportion OR would you need the adjustment to be done to the predicted value when you score a new dataset – such as when you train a model on a previous campaign and use the model to score the candidates for the upcoming campaign?
- I use logistic regression and decision trees for classification of categorical dependent variables – does the adjustment from (question 1) apply to both of these? Does the answer to (question 2) also apply to both of these techniques?
Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.
To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the oversampling rate is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, Mastering Data Mining. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called backfitting the model.
To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.
Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written several papers on the subject.