Hi,In your experience with decision trees, do you prefer to use a small set of core variables in order to make the model more elegant and/or understandable? At what point do you feel a tree has grown too large and complicated? What are the indicators that typically tell you that you need to do some pruning?Thank you!-Adam
Elegance and ease of understanding may or may not be important depending on your model's intended purpose. There are certainly times when it is important to come up with a small set of simple rules. In our book Mastering Data Mining we give an example of a decision tree model used to produce rules that were printed on a poster next to a printing press so the press operators could avoid a particular printing defect. When a decision tree is used for customer segmentation, it is unlikely that your marketing department is equipped to handle more than a handful of segments and the segments should be described in terms of a few famiar variables. In both of these cases, the decision tree is meant to be descriptive.
On the other hand, many (I would guess most) decision trees are not intended as descriptions; they are intended to produce scores of some kind. If the point of the model is to give each prospect a probability of response, then I see no reason to be concerned about having hundreds or even thousands of leaves so long as each one receives sufficient training records that the proportion of responders at the leaf is a statistically confident estimate of the response probability. A very nice feature of decision tree models is that one need not grok the entire tree in order to interpret any particular rule it generates. Even in a very complex tree, the path from the root to a particular leaf of interest gives a fairly simple description of records contained in that leaf.
For trees used to estimate some continuous quantity, an abundance of leaves is very desirable. As estimators, regression trees have the desirable quality of never making truly unreasonable estimates (as a linear regression, for example, might do) because every estimate is an average of a large number of actual observed values. The downside is that it cannot produce any more distinct values than it has leaves. So, the more leaves the better.
The need for pruning usually arises when leaves are allowed to become too small. This leads to splits that are not statistically significant. Apply each split rule to your training set and a validation set drawn from the same population. You should see the same distribution of target classes in both training and validation data. If you do not, your model has overfit the training data. Many software tools have absurdly low default minimum leaf sizes--probably because they were developed on toy datasets such as the ubiquitous irises. I routinely set the minimum leaf size to something like 500 so overfitting is not an issue and pruning is unnecesary.
I have focused on the number of leaves rather than the number of variables since I think that is a better measure of tree complexity. You actually asked about the number of variables. I recommend a two-stage approach. In the first stage, do not worry about how many variables there are or which variable from each family of related variables gets picked by the model. One of the great uses for a decision tree is to pick a small subset of useful variables out of hundreds or thousands of candidates. At a later stage, look at the variables that were picked and think about what concept each of them is getting at. Then pick a set of variables that express those concepts neatly and perhaps even elegantly. You might find, for example, that the customer ID is a good predictor and appears in many rules because customer IDs were assigned serially and long-time customers behave differently than recent customers. Even though this makes perfect sense, it would be hard to explain so you would replace it with a more transparent indication of customer tenure such as "months since first purchase."