## Tuesday, September 15, 2009

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Best,
Diego

Gentlemen-

I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!
Brian

The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
• rare outcomes: 10% --> 50%
• common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
• 0.80 / 5 = 0.16
• 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
• 10 rows in the model set are rare and classified as rare
• 5 rows in the model set are rare and classified as common
• 3 rows in the model set are common and classified as rare
• 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
• 2=10/5 rows in the original data are rare and classified as rare
• 1=5/5 rows in the original data are rare and classified as common
• 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
• 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

1. Good Post!

I have also used the technique of Platt successfully in a case of weighted re-balancing of the training set (basically using logistic regression to map probabilities in the oversampled space to those of the actual).

Here is a paper for the interested:
http://lkm.fri.uni-lj.si/xaigor/eng/scipaper/NiculescuMizilCaruana05-icml.pdf

Jeff

2. Gentlemen,

I have been working using over and down sampling for really unbalanced data but reading Brian's question I'm wondering. Is it better to use over sampling or use the weights he mentions? Do you have papers that compare that two techniques?

3. I guess the same logic will work in multinomial logit too. Is it correct?

4. Oversampling will change the parameter estimates for a logistic regression in a more complicated way. Some readings on this topic are listed at http://gking.harvard.edu/projects/rareevents.shtml.

5. In case of Unbalanced data set, how reliable is C value? Can C value be treated as measure of Model fit as mostly in unbalanced data set models number of ties are higher.

6. Hi All,

I have a similar situation, where I need to find out english rules/patterns from an imbalanced dataset using SAS E-Miner 7.1. Target to non target proportion is 1%:99%.

So I took a random sample of the majority class and included all the cases of target class (such that their proportions in the sample become 10%:90%).

I have few questions to ask, which shall help me develop a framework for this analysis.

1) Use of decision weights: Using decision weights changes the pattern or tree structure. But are these really mandatory to be used.

2) Cut off/ threshold determination: Threshold determination can play an important role as I will get additional rules (at a lower cut off than default of 0.5) which differentiate the target from non target. I ran this node in SAS E-Miner and chose a lower cut off. However, an interesting thing occured. I could find instances of 2 leaf nodes from the same parent node which yielded the same decision of target = Y. Does this mean that the splitting variable now becomes redundant. I chose the cut off basis the maximum average profit in validation dataset.

3) How can I make the results more generic, rather than reporting them only on one sample? If i take multiple iterations of decision trees with the same sample proportions it will results in different rules for each iteration. Some of the "strong" rules will get reflected in almost all iterations, some would be iteration specific, some may have overlap in different iterations (eg: rule 1 has variables A and B and C and D rule 2 has iterations A and B and C). Is there a way to combine these results effectively in SAS E-Miner.

4) If I use the adjusted priors (same as original priors) in the input data and take a sample with different proportions then the decision tree uses the adjusted priors. Meaning the effect of using adjusted priors and taking a sample to build model nullify each other. Would like to know if this is correct or not?

Any help on these questions would be highly appreciated.

Your comment will appear when it has been reviewed by the moderators.