I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,
This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.
Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.
My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!
Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.
If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.