tag:blogger.com,1999:blog-3366935554564939610.post3946394129837158097..comments2021-09-20T02:39:30.570-04:00Comments on Data Miners Blog: Oversampling in GeneralMichael J. A. Berryhttp://www.blogger.com/profile/06077102677195066016noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-3366935554564939610.post-77442738399620204072014-08-15T17:26:43.016-04:002014-08-15T17:26:43.016-04:00I tried to track down the paper that Joey C refere...I tried to track down the paper that Joey C referenced, but he did not leave enough detail.Michael J. A. Berryhttps://www.blogger.com/profile/06077102677195066016noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-51945234937198450462014-06-01T01:58:34.770-04:002014-06-01T01:58:34.770-04:00The second link does not work any more. Could you...The second link does not work any more. Could you specify the title and author of the article? Thank you!Crystal Donghttps://www.blogger.com/profile/04120768859720688810noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-76539032798825829782012-10-25T16:41:30.024-04:002012-10-25T16:41:30.024-04:00Popular statistical procedures, like logistic regr...Popular statistical procedures, like logistic regression, can sharply underestimate the probability of rare events (http://nrs.harvard.edu/urn-3:HUL.InstRepos:4125045). Thus, oversampling helps to correct for this bias (http://www.dnbtransunion.com/insights/whitepapers/whitepaper_event.pdf) Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-78709193984300029972011-04-14T14:26:51.798-04:002011-04-14T14:26:51.798-04:00Great discussion of oversampling and some of the p...Great discussion of oversampling and some of the potential benefits it can bring. I find it interesting that you discuss whether or not to oversample strictly in terms of the algorithm chosen, which I think is likely right on, but it makes me wonder about two things.<br /><br />First, is the decision to oversample a function only of the algorithm, or is it also a function of some of the features of the dataset as well. If so, I wonder if some of the features that have an impact might be more than just the number of instances per class, but maybe other characteristics of the data such as skewness or entropy, or something else that might not be immediately obvious.<br /><br />Second, if the characteristics of dataset donâ€™t have much of an impact (e.g., maybe they make a difference, but if it turns out to be so small compared to the difference of the algorithms), then I wonder if it would be better for the algorithms themselves to take care of, and take into account, the oversampling. That way they could each account for how this affects their respective inductive biases.Scott Burtonhttp://blog.thinkoriginally.comnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-76186363464888151982009-12-02T19:36:13.120-05:002009-12-02T19:36:13.120-05:00Hi Gordon,
About oversampling and regression: if ...Hi Gordon,<br /><br />About oversampling and regression: if I'm being a bad puppy and using stepwise/forwards/backwards with a lot of variables, and I have an unbalanced data set (say around 1% positive) then in my experience running the process as-is produces much less satisfactory results than using an oversampled data set.Ed Freemanhttp://tactical-logic.blogspot.com/noreply@blogger.com