tag:blogger.com,1999:blog-3366935554564939610.post4588165665484972278..comments2021-09-20T02:39:30.570-04:00Comments on Data Miners Blog: Adjusting for OversamplingMichael J. A. Berryhttp://www.blogger.com/profile/06077102677195066016noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-3366935554564939610.post-24764342974237340422013-12-21T01:29:21.911-05:002013-12-21T01:29:21.911-05:00Hi All,
I have a similar situation, where I need...Hi All, <br /><br />I have a similar situation, where I need to find out english rules/patterns from an imbalanced dataset using SAS E-Miner 7.1. Target to non target proportion is 1%:99%. <br /><br />So I took a random sample of the majority class and included all the cases of target class (such that their proportions in the sample become 10%:90%). <br /><br />I have few questions to ask, which shall help me develop a framework for this analysis. <br /><br />1) Use of decision weights: Using decision weights changes the pattern or tree structure. But are these really mandatory to be used. <br /><br />2) Cut off/ threshold determination: Threshold determination can play an important role as I will get additional rules (at a lower cut off than default of 0.5) which differentiate the target from non target. I ran this node in SAS E-Miner and chose a lower cut off. However, an interesting thing occured. I could find instances of 2 leaf nodes from the same parent node which yielded the same decision of target = Y. Does this mean that the splitting variable now becomes redundant. I chose the cut off basis the maximum average profit in validation dataset.<br /><br />3) How can I make the results more generic, rather than reporting them only on one sample? If i take multiple iterations of decision trees with the same sample proportions it will results in different rules for each iteration. Some of the "strong" rules will get reflected in almost all iterations, some would be iteration specific, some may have overlap in different iterations (eg: rule 1 has variables A and B and C and D rule 2 has iterations A and B and C). Is there a way to combine these results effectively in SAS E-Miner. <br /><br />4) If I use the adjusted priors (same as original priors) in the input data and take a sample with different proportions then the decision tree uses the adjusted priors. Meaning the effect of using adjusted priors and taking a sample to build model nullify each other. Would like to know if this is correct or not?<br /><br />Any help on these questions would be highly appreciated.aditya joshihttps://www.blogger.com/profile/01266845158001659077noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-63081922934608861482012-04-19T05:25:12.868-04:002012-04-19T05:25:12.868-04:00In case of Unbalanced data set, how reliable is C ...In case of Unbalanced data set, how reliable is C value? Can C value be treated as measure of Model fit as mostly in unbalanced data set models number of ties are higher.Prashantnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-6257012987165902922010-04-22T11:56:04.909-04:002010-04-22T11:56:04.909-04:00Oversampling will change the parameter estimates f...Oversampling will change the parameter estimates for a logistic regression in a more complicated way. Some readings on this topic are listed at http://gking.harvard.edu/projects/rareevents.shtml.Michael J. A. Berryhttps://www.blogger.com/profile/06077102677195066016noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-49510945827278777712010-04-21T07:14:49.598-04:002010-04-21T07:14:49.598-04:00I guess the same logic will work in multinomial lo...I guess the same logic will work in multinomial logit too. Is it correct?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-9367888264958947762009-09-21T10:43:35.664-04:002009-09-21T10:43:35.664-04:00Gentlemen,
I have been working using over and dow...Gentlemen,<br /><br />I have been working using over and down sampling for really unbalanced data but reading Brian's question I'm wondering. Is it better to use over sampling or use the weights he mentions? Do you have papers that compare that two techniques?Unknownhttps://www.blogger.com/profile/11103699238402710920noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-77640042037892099822009-09-18T08:21:21.732-04:002009-09-18T08:21:21.732-04:00Good Post!
I have also used the technique of Plat...Good Post!<br /><br />I have also used the technique of Platt successfully in a case of weighted re-balancing of the training set (basically using logistic regression to map probabilities in the oversampled space to those of the actual). <br /><br />Here is a paper for the interested:<br />http://lkm.fri.uni-lj.si/xaigor/eng/scipaper/NiculescuMizilCaruana05-icml.pdf<br /><br />JeffJeff Allardnoreply@blogger.com