I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)

What software allows you to use the frequency of a given sample pattern, along with the class? Are you adding a weight variable like you would if you had a rare event class (i.e. making the weight for this "redundant" pattern large relative to other observations)?

ReplyDeleteThere is a version of libsvm that allows providing weights to each sample -- though it is not fully supported. I rethought whether I should use SVM for awhile, but eventually returned to it. I use the weights rather heuristically -- I first choose an acceptable mixture (fraction) of neg/pos classes for a cross-labeled redundant feature pattern. This may be >%50 pos, or only >%20, depending on whether I am emphasizing accuracy or coverage.

ReplyDeleteeg to make counts for weights, counts1 are pos class observations of feature pattern,min_accuracy is threshold:

ffactor = min_accuracy/(1-min_accuracy)

nc = counts1 - ffactor * counts0

if nc >= 0: # rebalance to positive class

newcount0 = 0

newcount1 = nc

else: # rebalance to negative class

newcount0 = -nc

newcount1 = 0

Even the %20 min_accuracy can be useful, as this is twice freq of positive class, and this is only one step of a sequential analysis in which downstream steps can increase accuracy.

Seems to be performing ok so far, but only starting cross-validation. Most work in keeping everything fitting in 8GB RAM per process.

Am now seeing if I can make/find a weighted sample version of liblinear.