Here’s a quick question for your blog;
- background -
I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s)
We often use neural nets to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers.
- problem -
We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.
When I build my predictive model I try to mimic this distribution. My view that is most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.
Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout.
- question -
What are your views/preferences on this?
I see no reason to rescale the scores. Of course, if the only use of the scores is to mail the top 5% of the list it makes no difference since the transformation preserves the ordering, but for other applications you want the score to be an estimate of the actual probability of cancellation.
In general, scores that represent the probability of an event are more useful than scores which only order a list in descending order by probability of the event. For example, in a campaign response model, you can multiply the probability that a particular prospect will respond by the value of that response to get an expected value of making the offer. If the expected value is greater than the cost, the offer should not be made. Gordon and I discuss this and related issues in our book Mastering Data Mining.
This issue often comes up when stratified sampling is used to create a balanced model set of 50% responders and 50% non-responders. For some modeling techniques--notably, decision trees--a balanced model set will produce more and better rules. However, the proportion of responders at each leaf is no longer an estimate of the actual probability of response. The solution is simple: simply apply the model to a test set that has the correct distribution of responders to get correct estimates of the response probability.