Thursday, March 3, 2011

Cluster Silhouettes

The book is done! All 822 pages of the third edition of Data Mining Techniques for Marketing, Sales, and Customer Relationship Management will be hitting bookstore shelves later this month or you can order it now. To celebrate, I am returning to the blog.

One of the areas where Gordon and I have added a lot of new material is clustering. In this post, I want to share a nice measure of cluster goodness first described by Peter Rousseeuw in 1986. Intuitively, good clusters have the property that cluster members are close to each other and far from members of other clusters. That is what is captured by a cluster's silhouette.

To calculate a cluster’s silhouette, first calculate the average distance within the cluster. Each cluster member has its own average distance from all other members of the same cluster. This is its dissimilarity from its cluster. Cluster members with low dissimilarity are comfortably within the cluster to which they have been assigned. The average dissimilarity for a cluster is a measure of how compact it is. Note that two members of the same cluster may have different neighboring clusters. For points that are close to the boundary between
two clusters, the two dissimilarity scores may be nearly equal.

The average distance to fellow cluster members is then compared to the average distance to members of the neighboring cluster. The pictures below show this process for one point (17, 27).

The ratio of a point's dissimilarity to its own cluster to its dissimilarity with its nearest neighboring cluster is its silhouette. The typical range of the score is from zero when a record is right on the boundary of two clusters to one when it is identical to the other records in its own cluster. In theory, the silhouette score can go from negative one to one. A negative value means that the record is more similar to the records of its neighboring
cluster than to other members of its own cluster. To see how this could happen, imagine forming clusters using an agglomerative algorithm and single-linkage distance. Single-linkage says the distance from a point to a cluster is the distance to the nearest member of that cluster.  Suppose the data consists of many records with the value 32 and many others with the value 64 along with a scattering of records with values from 32 to 50. In the first step, all the records at distance zero are combined into two tight clusters. In the next step, records distance one away are combined causing some 33s to be added to the left cluster followed by 34s, 35s, etc. Eventually, the left cluster will swallow records that would feel happier in the right cluster.

The silhouette score for an entire cluster is calculated as the average of the silhouette scores of its members. This measures the degree of similarity of cluster members. The silhouette of the entire dataset is the average of the silhouette scores of all the individual records. This is a measure of how appropriately the data has been
clustered. What is nice about this measure is that it can be applied at the level of the dataset to determine which clusters are not very good and at the level of a cluster to determine which members do not fit in very well. The silhouette can be used to choose an appropriate value for k in k-means by trying each value of
k in the acceptable range and choosing the one that yields the best silhouette. It can also be used to compare clusters produced by different random seeds.

The final picture shows the silhouette scores for the three clusters in the example.


11 comments:

  1. Hello,

    when you say:

    "The silhouette can be used to choose an appropriate value for k in k-means by trying each value of
    k in the acceptable range and choosing the one that yields the best silhouette."

    What do you mean by "the acceptable range"?

    I'm trying to find out the natural k for a data set of 60 objects. I calculate the dataset's silhouette for all possible values of k (1 -> 60). All I see is the silhouette values tending to 1 when k tends to 60 wich i think makes sense... I was expecting some peak values to choose a natural k from, but I found many peaks and all of them tending to grow as we approach k=60.

    What do you think about that?

    Thank you for your time :)

    ReplyDelete
  2. In general, there is no reason to assume there is a natural value for k. Often the value for k is dictated by needs of the application. What I meant by "acceptable range" is that if you are creating customer segments and you are willing to support as few as 3 or as many as 6, then you could choose based on which value for k yields clusters with around the same number of members and a good silhouette value. I always assume that the number of clusters is much smaller than the number of data points.

    ReplyDelete
  3. Thank you very much for your help :)

    ReplyDelete
  4. Without having looked at your new book, I do find the concept of cluster metrics an interesting one. However, I don't really see much diversity in the way of cluster metrics. Most fields seem to fall into some sort of a local minima where they consistently reuse the same (and sometimes flawed) metric. This comment might just be flaunting my ignorance, but do you see that problem as well, or is there more diversity in the world of clustering metrics than I have seen?

    I would also ask, how does this metric compare to others that are used? Does it compare favorably? How often would you guess that it is used relative to other clustering metrics?

    Thanks for your time.

    ReplyDelete
  5. @ Henrique Alves:

    when you have as much centroids as datapoint what you have done is essentially created a lookup table. You will get a high result because every centroid is on the datapoint!!

    Your model of centroids is completely useless because it is overfitted.

    ReplyDelete
    Replies
    1. Not sure which of the above comments you are responding to. I don't think anyone is suggesting having a high number of clusters. In the applications I am most familiar with hundreds of thousand or millions of customers are assigned to one of a handful of segments.

      Delete
    2. Hey,

      I am replying to Henrique Alves's first post and to why when he sets a K value which is the same to the number of datapoint, in this case 60.

      Delete
  6. Hey Michael,

    I am working on a customer clustering project with marketing team. I was wondering if it is quite necessary to use their domain knowledge to compare the result of different K, or even using different values, because it's hard to explain the result simply via such metrics.

    Thanks :-)

    ReplyDelete
    Replies
    1. Yes. In a marketing context, I would describe clusters by how they differ according to metrics important to the client--revenue, tenure, purchase velocity, average order value, or whatever. It can also be useful to describe the average cluster member.

      Delete
  7. Hello,

    I've been reading through Chapters 13 and 14 and find them to be easy to understand and very helpful. The one topic I was hoping to find, which doesn't seem to be covered, concerns how to treat categorical variables.

    My understanding is that most clustering methods rely on a distance metric, so only numeric variables should be included. Seeing that many common marketing variables are categorical (gender, ethnicity, and binned income), can you point me towards some best practices regarding how to handle these data types?

    Thanks!

    ReplyDelete
  8. Chapter 13 (briefly) discusses k-medoids which can be used directly.

    For simple categories such a gender, encoding them as "0", "1" (and perhaps "0.5") is probably sufficient. When you have a lot of categorical values and want to look at groups of them, then Association Rules comes into play (Chapter 15).

    One common technique for handling discrete values is to turn them into numeric flags and then do the clustering in the space of principal components. This could be considered a "best practice".

    Another approach is to use a distance function. You need a way to define the distance between two records, but it does not have to be Euclidean distance. Once upon a time, IBM had an algorithm they called "demographic clustering" that measured the distance between two records by the number of fields they have in common.

    SAS and other tools allow you to pass in a distance matrix for hierarchical clustering. This distance matrix can take categorical values into account.

    ReplyDelete

Your comment will appear when it has been reviewed by the moderators.