The book is done! All 822 pages of the third edition of Data Mining Techniques for Marketing, Sales, and Customer Relationship Management will be hitting bookstore shelves later this month or you can order it now. To celebrate, I am returning to the blog.
One of the areas where Gordon and I have added a lot of new material is clustering. In this post, I want to share a nice measure of cluster goodness first described by Peter Rousseeuw in 1986. Intuitively, good clusters have the property that cluster members are close to each other and far from members of other clusters. That is what is captured by a cluster's silhouette.
To calculate a cluster’s silhouette, first calculate the average distance within the cluster. Each cluster member has its own average distance from all other members of the same cluster. This is its dissimilarity from its cluster. Cluster members with low dissimilarity are comfortably within the cluster to which they have been assigned. The average dissimilarity for a cluster is a measure of how compact it is. Note that two members of the same cluster may have different neighboring clusters. For points that are close to the boundary between
two clusters, the two dissimilarity scores may be nearly equal.
The average distance to fellow cluster members is then compared to the average distance to members of the neighboring cluster. The pictures below show this process for one point (17, 27).
The ratio of a point's dissimilarity to its own cluster to its dissimilarity with its nearest neighboring cluster is its silhouette. The typical range of the score is from zero when a record is right on the boundary of two clusters to one when it is identical to the other records in its own cluster. In theory, the silhouette score can go from negative one to one. A negative value means that the record is more similar to the records of its neighboring
cluster than to other members of its own cluster. To see how this could happen, imagine forming clusters using an agglomerative algorithm and single-linkage distance. Single-linkage says the distance from a point to a cluster is the distance to the nearest member of that cluster. Suppose the data consists of many records with the value 32 and many others with the value 64 along with a scattering of records with values from 32 to 50. In the first step, all the records at distance zero are combined into two tight clusters. In the next step, records distance one away are combined causing some 33s to be added to the left cluster followed by 34s, 35s, etc. Eventually, the left cluster will swallow records that would feel happier in the right cluster.
The silhouette score for an entire cluster is calculated as the average of the silhouette scores of its members. This measures the degree of similarity of cluster members. The silhouette of the entire dataset is the average of the silhouette scores of all the individual records. This is a measure of how appropriately the data has been
clustered. What is nice about this measure is that it can be applied at the level of the dataset to determine which clusters are not very good and at the level of a cluster to determine which members do not fit in very well. The silhouette can be used to choose an appropriate value for k in k-means by trying each value of
k in the acceptable range and choosing the one that yields the best silhouette. It can also be used to compare clusters produced by different random seeds.
The final picture shows the silhouette scores for the three clusters in the example.