If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).
I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.
Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?
Thanks in advance for your reply and help.
I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.
I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.
What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?
Many many thanks!
The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.
Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.
Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.
The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
- rare outcomes: 10% --> 50%
- common outcomes: 90% --> 50%
We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?
Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
- 0.80 / 5 = 0.16
- 0.20 / (5/9) = 0.36
This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
- 10 rows in the model set are rare and classified as rare
- 5 rows in the model set are rare and classified as common
- 3 rows in the model set are common and classified as rare
- 12 rows in the model set are common and classified as common
In the original data, this means:
- 2=10/5 rows in the original data are rare and classified as rare
- 1=5/5 rows in the original data are rare and classified as common
- 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
- 21.6 = 12/(5/9) rows in the original data are common and classified as common
I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.