I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.
I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.
A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.
I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?
Thanks for any insight. Will start on your archives.
Here is his reply:
I have some fundamental questions about the appropriateness of SVM for this classification problem:SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)