tag:blogger.com,1999:blog-3366935554564939610.post7003932318442616115..comments2020-07-06T06:27:37.598-04:00Comments on Data Miners Blog: Adjusting for oversamplingMichael J. A. Berryhttp://www.blogger.com/profile/06077102677195066016noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-3366935554564939610.post-10592674044734211782008-07-21T11:05:00.000-04:002008-07-21T11:05:00.000-04:00There is some evidence that straightforward oversa...There is some evidence that straightforward oversampling for SVMs does not buy you much,<BR/><BR/><A HREF="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=618F1FEB753FBAC8715131516831DDF4?doi=10.1.1.101.5878&rep=rep1&type=pdf" REL="nofollow">The Effect of Oversampling and Undersampling on Classifying ...</A><BR/><BR/>unless you use some very fancy oversampling techniques that generate synthetic data points from the minority class. This seems to also jibe with the intuition behind SVMs with fairly rigid margins (large values for the cost constant C). <BR/><BR/>SVMs construct decision surfaces based on the convex hulls of the classes, it does not matter how many data points are contained within the hulls (many points for the majority class, few points minority class). Straightforward oversampling of the minority class simply replicates the points within the hull of the minority class and therefore does not buy you much from an SVM point of view. (For an introduction to this geometric interpretation of SVMs see Bennett & Colin's paper<BR/><A HREF="http://www.sigkdd.org/explorations/issue2-2/bennett.pdf" REL="nofollow">Support Vector Machines: Hype or Hallelujah?</A>)<BR/><BR/>If you employ an SVM with soft-margins (small values for C), then you allow the SVM to make a certain amount of classification errors. In this case all points of the minority class might be considered classification errors and you could obtain a non-sensical model in the sense mentioned in the blog.<BR/><BR/>The value of C depends highly on the kind of kernel you choose. The more complex the kernel, the more readily the data might become separable in kernel space necessitating a lower value for C for perfect classification and vice versa.bluegrassbasshttps://www.blogger.com/profile/14132193973093771284noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-7343327637781703682008-07-01T20:50:00.000-04:002008-07-01T20:50:00.000-04:00Does anyone have experience oversampling (all posi...Does anyone have experience oversampling (all positive rare cases, sample common negative cases) using SVM? Specifically like this question how to adjust the probabilities back again?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-13940694095830080352008-06-09T07:35:00.000-04:002008-06-09T07:35:00.000-04:00With respect to the reference to Chapter 7, I thin...With respect to the reference to Chapter 7, I think it is fair to point out a few imprecisions in that chapter. The OSR computed on p. 201 is not correct. Instead it should be 42.43 (0.99*0.3/(0.01*0.7)). This implies that the predicted density becomes 0.175 (noticing the obvious typo in the formula) corresponding to a lift of 17.5.Unknownhttps://www.blogger.com/profile/00436303670269462352noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-23272882193109016292008-05-05T15:12:00.000-04:002008-05-05T15:12:00.000-04:00In part I disagree with :"If only 1% of claims are...In part I disagree with :"If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time!" This is correct if you are looking for the non-fraudulent claims. Usually with unbalanced data you are looking for the sparse class, the fraudulent claims. In that case I have good experiences on datasets with as few as 0.5% targets. I do not care about the 99%, but about the leaves of the tree where 10% or 20% or 50% consist of rare cases. If you oversample to a 50-50 situation, you loose too much information (unless the rare class numbers in the tens of thousends) In tests with decision trees I found that augmenting the proportion of the abundant class allways enhances the capability of the model to identify the rare class.Unknownhttps://www.blogger.com/profile/03403026401023360190noreply@blogger.com