Monday, June 8, 2009

Confidence in Logistic Regression Coefficients

I work in the marketing team of a telecom company and I recently encountered an annoying problem with an upsell model. Since the monthly sale rate is less than 1% of our customer base, I used oversampling as you mentioned in your book ‘Mastering data mining’ with data over the last 3 sales months so that I had a ratio of about 15% buyers and 85% non-buyers (sample size of about 20K). Using alpha=5%, I got parameter estimates which were from a business perspective entirely explicable. However, when I then re-estimated the model on the total customer base to obtain the ‘true’ parameter estimates which I will use for my monthly scoring two effects were suddenly insignificant at alpha=5%.

I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,
Wendy


Michael responds:

Hi Wendy,

This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.


Gordon responds:

Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.

My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!

Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.

If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.


5 comments:

  1. This comment came in as an e-mail:


    The logistic function always behave like that. It have to do with the estimation procedure. If a result is no likely it cannot have any explanation.

    The trick, when using statistical model, is to stick with the theorie.

    There is another solution for those using logistic regression, it’s usually used by the credit scoring team. They usually use a biased sample with the two outcome set equally. Then they use a rescaling procedure. Please check Guidelines on Credit Risk Management – Rating Models and Validation These guidelines were prepared by the Oesterreichische Nationalbank (OeNB) in cooperation with the Financial Market Authority (FMA). You can find these papper on the internet


    «Rescaling default probabilities is necessary whenever the proportion of good and bad cases in the sample does not match the actual composition of the portfolio in which the rating model is meant to be used.
    (…) The scaling process is performed in such a way that the segment’s correct average default probability is attained using a sample which is representative of the segment. For example, it is possible to use all good cases from the data collected as a representative sample, as these represent the bank’s actual portfolio to be captured by the rating model. (…)

    The process of rescaling the results of logistic regression involves six steps:
    1. Calculation of the average default rate resulting from logistic regression using a sample which is representative of the non-defaulted portfolio
    2. Conversion of this average sample default rate into RDFsample
    Note: RDF (relative default frequencies (RDFs), is directly proportional to the general probability of default (PD):
    RDF=PD/(1 - PD)
    PD = RDF/(1-RDF)
    3. Calculation of the average portfolio default rate and conversion into RDFportfolio
    4. Representation of each default probability resulting from logistic regression as RDFunscaled
    5. Multiplication of RDFunscaled by the scaling factor specific to the rating model

    RDFscaled = RDFunscaled *(RDFportfolio/RDFsample)
    6. Conversion of the resulting scaled RDF into a scaled default probability.
    This makes it possible to calculate a scaled default probability for each possible value resulting from logistic regression. Once these default probabilities have been assigned to grades in the rating scale, the calibration is complete.

    ReplyDelete
  2. The following comment came in as an e-mail:

    The logistic function always behave like that. It have to do with the estimation procedure. If a result is no likely it cannot have any explanation.

    The trick, when using statistical model, is to stick with the theorie.

    There is another solution for those using logistic regression, it’s usually used by the credit scoring team. They usually use a biased sample with the two outcome set equally. Then they use a rescaling procedure. Please check Guidelines on Credit Risk Management – Rating Models and Validation These guidelines were prepared by the Oesterreichische Nationalbank (OeNB) in cooperation with the Financial Market Authority (FMA). You can find these papper on the internet


    «Rescaling default probabilities is necessary whenever the proportion of good and bad cases in the sample does not match the actual composition of the portfolio in which the rating model is meant to be used.
    (…) The scaling process is performed in such a way that the segment’s correct average default probability is attained using a sample which is representative of the segment. For example, it is possible to use all good cases from the data collected as a representative sample, as these represent the bank’s actual portfolio to be captured by the rating model. (…)

    The process of rescaling the results of logistic regression involves six steps:
    1. Calculation of the average default rate resulting from logistic regression using a sample which is representative of the non-defaulted portfolio
    2. Conversion of this average sample default rate into RDFsample
    Note: RDF (relative default frequencies (RDFs), is directly proportional to the general probability of default (PD):
    RDF=PD/(1 - PD)
    PD = RDF/(1-RDF)
    3. Calculation of the average portfolio default rate and conversion into RDFportfolio
    4. Representation of each default probability resulting from logistic regression as RDFunscaled
    5. Multiplication of RDFunscaled by the scaling factor specific to the rating model

    RDFscaled = RDFunscaled *(RDFportfolio/RDFsample)
    6. Conversion of the resulting scaled RDF into a scaled default probability.
    This makes it possible to calculate a scaled default probability for each possible value resulting from logistic regression. Once these default probabilities have been assigned to grades in the rating scale, the calibration is complete.»

    ReplyDelete
  3. Excuse me, but there's a misprint in the formula PD = RDF/(1-RDF).
    The correct version is PD = RDF/(1+RDF).

    ReplyDelete
  4. Michael, could you please disclose the formula which is used for "6. Conversion of the resulting scaled RDF into a scaled default probability"?
    I wonder, is it just [PDscaled = RDFscaled/(1+RDFscaled)]? Or is it more complicated?

    ReplyDelete
  5. Hello Michael,

    Once again I cannot put a comment in you blog... I 've tried as a Anonymous or with an Open ID.
    If you can put this comment in your blog it would be perfect.

    Thank you

    ****************

    Hello Artem,
    1. Yes, there is a missprint in my comment: it should be PD = RDF/(1+RDF).

    2. Yes, I believe that you are right is just: PDscaled = RDFscaled/(1+RDFscaled)]
    at least this is what I use.

    Nevertheless, "if the results generated by the rating model are not already sample-dependent
    default probabilities but (for example) score values, it is first necessary to assign
    default probabilities to the rating results."

    One possible way of doing so is outlined in here pp 86: http://www.oenb.at/en/img/rating_models_tcm16-22933.pdf.

    Good mining

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.