Hi Gordon & Michael,
I have a few questions, hope you can help me!
1. While modeling, if we don’t have a very specific client requirement, at what accuracy should we usually stop? Should we stop at 75%, or 80%? Are there standard accuracy requirements based on the industry? For example, in drug research & development, model accuracy is required to be very high.
2. What is the best approach for selecting records/training dataset when the client doesn’t have info on the cut-off/valid ranges for certain numeric columns? If it’s something like Age, there is no problem. But when it’s client/business specific columns, it’s not that easy to figure out the valid ranges. What I usually do for such problems is – 1. do some research on the web to have an understanding on all the values that the specific column can take 2. see the data distribution of that column and select values based on the percentiles. E.g if values from 10 to 60 (for that column) represent 80% of all the records, I exclude all records having values outside this range. Is this a good approach? Are there other alternatives?
3. Generally, I see model accuracy (predictive/risk/churn models) getting better when I recode/transform continuous variables into categorical variables through binning/grouping. But this also results in loss of information. How do we strike a balance here? I believe the business/domain should only decide whether I should use continuous or categorical values, and not the statistics. Is that correct?
Will check your blog regularly for the answersJ
These three questions have something in common: There is no single right answer since so much depends on the business context (in the first two cases) or the modeling context (in the third case).
My statement about no right answers is especially true of the question regarding accuracy. There are contexts where a 95% error rate is perfectly acceptable. I am thinking of response modeling for direct mail. If a model is used to choose people likely to respond to an offer and only 5% of those chosen actually respond, then the error rate is 95%. How could that be acceptable? Well, if a 4% response rate is required for profitability and the response rate for a randomly selected control group is 3% then the model--despite its apparently terrible error rate--has heroically turned a money-losing campaign into a profitable one. Success is measured in dollars (or rupees or yen, but you know what I mean) not by error rates.
In other contexts, much better accuracy is required. A model for credit-card fraud cannot afford a high false-positive rate because this will result in legitimate transactions not being approved. The result is unhappy card holders canceling their accounts. Even if your client cannot provide an explicit requirement for accuracy, you may be able to derive one from the business context.
Absent any other constraints, I tend to stop trying to improve a model when I reach the point of diminishing returns. When a large effort on my part yields only a minor improvement, my time will probably be better spent on some other problem.
This question is really about when to throw out data. I see know reason to discard data just because it happens to be in the tails of the distribution. To use your example where 80% of the records have values between 10 and 60, it may be that all the best customers have a value of 75 or more. It may make sense to throw out records which contain clearly impossible values, but even in that case, I would want to understand how the impossible values were generated. If all the records with impossibly high ages were generated in the same geographic region or from the same distribution channel, throwing them out will bias your sample.
Often, unusual values have some fairly simple explanation. When looking at loyalty card data for a supermarket, we found that there were a few cards that had seemingly impossibly large numbers of orders. The explanation was that when people checked out without their card and were therefore in danger of missing out on a discount, the nice checkout lady took pity on them and used her own card to get them the discount. Understanding that mechanism meant we could safely ignore data for those cards since they did not represent the actual shopping habits of any real customer.
Whether or not binning continuous variables is helpful or harmful will depend very much on the particular modeling algorithm you are using and on how the binning is performed. I do not agree that, as a general rule, models are improved by binning continuous variables. As you note, this process destroys information. As an extreme example, suppose you have a relationship that is completely determined by a continuous (or discrete, but with small increments) relationship--a tax of a constant amount per liter, say. The more accurately you can measure the number of liters sold, the more accurately you can estimate the tax revenue. In such a case, binning could only be harmful.
When binning tends to be helpful is when the relationship between the explanatory variable and the thing you are trying to explain is more complex than the particular modeling technique you have chosen can handle. For example, you have chosen a linear model and the relationship is non-linear. I once modeled household penetration for my local newspaper, the Boston Globe. One of my explanatory variables was distance from Boston. Clearly, this should have some effect, but there is only a low level of linear correlation. This is because penetration goes up as a function of distance as you travel out to the first ring of suburbs where penetration is highest, but then goes down again as you continue to travel farther from Boston. So a linear model could not make good use of the untransformed variable, but it could make use of three variables in the form within_three, three_to_ten, and beyond_ten (assuming that 3 and 10 are the right bin boundaries). Of course, binning is not the only transformation that could help and linear models are not the only choice of model.