Thursday, October 2, 2008

Decision Trees and Clustering


I started to write my master thesis and i chose a data mining topic.What I have to do is to analyze the bookings of an airline company and to observe for which markets,time periods and clients the bookings can be trusted and for which not.(The bookings can anytime be canceled or modified ).

I decided to use the decision trees as a classification method but I somehow wonder if clustering would have been more appropriate in this situation.

Thanks and best regards,

When choosing between decision trees and clustering, remember that decision trees are themselves a clustering method. The leaves of a decision tree contain clusters of records that are similar to one another and dissimilar from records in other leaves. The difference between the clusters found with a decision tree and the clusters found using other methods such as K-means, agglomerative algorithms, or self-organizing maps is that decision trees are directed while the other techniques I mentioned are undirected. Decision trees are appropriate when there is a target variable for which all records in a cluster should have a similar value. Records in a cluster will also be similar in other ways since they are all described by the same set of rules, but the target variable drives the process. People often use undirected clustering techniques when a directed technique would be more appropriate. In your case, I think you made the correct choice because you can easily come up with a target variable such as the percentage cancelations, alterations and no-shows in a market.

You can make a model set that has one row per market. One column, the target, will be the percentage of reservations that get changed or cancelled. The other columns will contain everything you know about the market--number of flights, number of connections, ratio of business to leasure travelers, number of carriers, ratio of transit passengers to origin or destination passengers, percentage of same day bookings, same week bookings, same month bookings, and whatever else comes to mind. A decision tree will produce some leaves with trustworthy bookings and some with untrustworthy bookings and the paths from the root to these leaves will be descriptions of the clusters.


  1. Very helpful for my final year examination! Thanks so much!

  2. Is there a R package for decision-tree-based clustering

    1. Yes, there are several R packages for building decision trees (directed clustering). A popular one is rpart.

  3. yes there are& i think 'rattle' is so helpful!


Your comment will appear when it has been reviewed by the moderators.