Saturday, April 12, 2008

Using validation data in Enterprise Miner

Dear Sir/Madam,

I am a lecturer at De Montfort University in the UK and teach modules on
Data Mining at final year BSc and MSc level. For both of these we use the
Berry & Linoff Data Mining book. I have a couple of questions regarding SAS that I've been unable to find the answer to and I wondered if you could point in the direction of a source of info where I could find the answers. They are to do with partitioning data in SAS EM and how the different data sets are used. In the Help from SAS EM I see that it says the validation set is used in regression "to choose a final subset of predictors from all the subsets computed during stepwise regression" - so is the validation set not used in regression otherwise (e.g. in forward deletion and backward deletion)?

Also I'm not sure where we see evidence of the test set being used in any of the models I've developed (NNs, Decision Trees, Regression). I presume the lift charts are based on the actual model (resulting from the training and validation data sets) though I noticed if I only had a training and a validation data set (i.e. no test set) the lift chart gave a worse model.

I hope you don't mind me asking these questions - My various books and the help don't seem to explain fully but I know it must be documented somewhere.

best wishes, Jenny Carter

Dr. Jenny Carter
Dept. of Computing
De Montfort University
The Gateway
Leicester

Hi Jenny,

I'd like to take this opportunity to go beyond your actual question about SAS Enterprise Miner to make a general comment on the use of validation sets for variable selection in regression models and to guard against overfitting in decision tree and neural network models.

Historically, statistics grew up in a world of small datasets. As a result, many statistical tools reuse the same data to fit candidate models as to evaluate and select them. In a data mining context, we assume that there is plenty of data so there is no need to reuse the training data. The problem with using the training data to evaluate a model is that overfitting may go undetected. The best model is not the one that best describes the training data; it is the one that best generalizes to new datasets. That is what the validation is for. The details of how Enterprise Miner accomplishes this vary with the type of model. In no case does the test set get used for either fitting the model or selecting from among candidate models. Its purpose is to allow you to see how your model will do on data that was not involved in the model building or selection process.

Regression Models

When you use any of the model selection methods (Forward, Stepwise, Backward), you also get to select a method for evaluating the candidate models formed from different combinations of explanatory variables. Most of the choices make no use of the validation data. Akaike's Information Criterion and Schwarz's Bayesian Criterion both add a penalty term for the number of effects in the model to a function of the error sum of squares. This penalty term is meant to compensate for the fact that additional model complexity appears to lower the error on the training data even when the model is not actually improving. When you choose Validation Error as the selection criterion, you get the model that minimizes error on the validation set. That is our recommended setting. You must also take care to set Use Selection Default to No in the Model Selection portion of the property sheet of Enterprise Miner will ignore the rest of your settings.



When a training set, validation set, and test set are all present, Enterprise Miner will report statistics such as the root mean squared error for all three sets. The error on the test set, which is not used to fit models nor to select candidate models, is the best predictor of performance on unseen data.

Decision Trees

With decision trees, the validation set is used to select a subtree of the tree grown using the training set. This process is called "pruning." Pruning helps prevent overfitting. Some splits which have a sufficiently high worth (chai-square value) on the training data to enter the initial tree, fail to improve the error rate of the tree when applied to the validation data. This is especially likely to happen when small leaf sizes are allowed. By default, if a validation set is present, Enterprise Miner will use it for subtree selection.

Neural Networks

Training a neural network is an iterative process. Each training iteration adjusts the weights associated with each network connection. As training proceeds, the network becomes better and better at "predicting" the training data. By the time training stops, the model is almost certainly overfit. Each set of weights is a candidate model. The selected model is the one that minimizes error on the validation set. In the chart shown below, after 20 iterations of training the error on the training set is still declining, but the best model was reached after on 3 training iterations.


1 comment:

  1. I'm a novice to Data Mining and SAS Enterprise miner. I'm trying to create a lift chart using predicted probabilities of response. I have segmented the validation data using the Splitting rules derived from Training and Validation data but my Lift/Response numbers are not matching with that of Work.LiftData numbers. I suspect I'm not ranking the predicted probabilities correctly. Could you please tell me how SAS EM builds deciles? Any standard formula to group the data into deciles?

    Thanks

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.