tag:blogger.com,1999:blog-3366935554564939610.post4103659235793544203..comments2020-01-27T00:50:40.547-05:00Comments on Data Miners Blog: Cluster SilhouettesMichael J. A. Berryhttp://www.blogger.com/profile/06077102677195066016noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-3366935554564939610.post-29145792647908547182014-03-02T15:46:52.195-05:002014-03-02T15:46:52.195-05:00Chapter 13 (briefly) discusses k-medoids which can...Chapter 13 (briefly) discusses k-medoids which can be used directly. <br /><br />For simple categories such a gender, encoding them as "0", "1" (and perhaps "0.5") is probably sufficient. When you have a lot of categorical values and want to look at groups of them, then Association Rules comes into play (Chapter 15).<br /><br />One common technique for handling discrete values is to turn them into numeric flags and then do the clustering in the space of principal components. This could be considered a "best practice".<br /><br />Another approach is to use a distance function. You need a way to define the distance between two records, but it does not have to be Euclidean distance. Once upon a time, IBM had an algorithm they called "demographic clustering" that measured the distance between two records by the number of fields they have in common.<br /><br />SAS and other tools allow you to pass in a distance matrix for hierarchical clustering. This distance matrix can take categorical values into account.<br />Gordon S. Linoffhttps://www.blogger.com/profile/02341184075032239786noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-91455234586794776892014-02-07T19:26:20.040-05:002014-02-07T19:26:20.040-05:00Hello,
I've been reading through Chapters 13 ...Hello,<br /><br />I've been reading through Chapters 13 and 14 and find them to be easy to understand and very helpful. The one topic I was hoping to find, which doesn't seem to be covered, concerns how to treat categorical variables. <br /><br />My understanding is that most clustering methods rely on a distance metric, so only numeric variables should be included. Seeing that many common marketing variables are categorical (gender, ethnicity, and binned income), can you point me towards some best practices regarding how to handle these data types?<br /><br />Thanks!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-43469701401840381242013-08-28T20:46:20.456-04:002013-08-28T20:46:20.456-04:00Yes. In a marketing context, I would describe clu...Yes. In a marketing context, I would describe clusters by how they differ according to metrics important to the client--revenue, tenure, purchase velocity, average order value, or whatever. It can also be useful to describe the average cluster member.Michael J. A. Berryhttps://www.blogger.com/profile/06077102677195066016noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-46058077049497924782013-08-13T22:53:48.855-04:002013-08-13T22:53:48.855-04:00Hey Michael,
I am working on a customer clusterin...Hey Michael,<br /><br />I am working on a customer clustering project with marketing team. I was wondering if it is quite necessary to use their domain knowledge to compare the result of different K, or even using different values, because it's hard to explain the result simply via such metrics.<br /><br />Thanks :-)<br /><br />Aaron Chenhttps://www.blogger.com/profile/14296428287147475729noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-65001704646112751632012-08-22T10:52:05.948-04:002012-08-22T10:52:05.948-04:00Hey,
I am replying to Henrique Alves's first ...Hey,<br /><br />I am replying to Henrique Alves's first post and to why when he sets a K value which is the same to the number of datapoint, in this case 60.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-4274036879684131382012-08-21T18:39:09.485-04:002012-08-21T18:39:09.485-04:00Not sure which of the above comments you are respo...Not sure which of the above comments you are responding to. I don't think anyone is suggesting having a high number of clusters. In the applications I am most familiar with hundreds of thousand or millions of customers are assigned to one of a handful of segments.Michael J. A. Berryhttps://www.blogger.com/profile/06077102677195066016noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-47548410028274094882012-08-20T18:56:57.449-04:002012-08-20T18:56:57.449-04:00@ Henrique Alves:
when you have as much centroids...@ Henrique Alves:<br /><br />when you have as much centroids as datapoint what you have done is essentially created a lookup table. You will get a high result because every centroid is on the datapoint!! <br /><br />Your model of centroids is completely useless because it is overfitted. <br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-77896036634105119952011-04-13T22:56:03.376-04:002011-04-13T22:56:03.376-04:00Without having looked at your new book, I do find ...Without having looked at your new book, I do find the concept of cluster metrics an interesting one. However, I don't really see much diversity in the way of cluster metrics. Most fields seem to fall into some sort of a local minima where they consistently reuse the same (and sometimes flawed) metric. This comment might just be flaunting my ignorance, but do you see that problem as well, or is there more diversity in the world of clustering metrics than I have seen?<br /><br />I would also ask, how does this metric compare to others that are used? Does it compare favorably? How often would you guess that it is used relative to other clustering metrics?<br /><br />Thanks for your time.Richard Morrishttps://www.blogger.com/profile/15067268904290602052noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-85642364080761800452011-04-02T15:26:11.064-04:002011-04-02T15:26:11.064-04:00Thank you very much for your help :)Thank you very much for your help :)Henrique Alveshttps://www.blogger.com/profile/10314455414137959434noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-60783544865112266542011-04-02T14:30:25.415-04:002011-04-02T14:30:25.415-04:00In general, there is no reason to assume there is ...In general, there is no reason to assume there is a natural value for k. Often the value for k is dictated by needs of the application. What I meant by "acceptable range" is that if you are creating customer segments and you are willing to support as few as 3 or as many as 6, then you could choose based on which value for k yields clusters with around the same number of members and a good silhouette value. I always assume that the number of clusters is much smaller than the number of data points.Michael J. A. Berryhttps://www.blogger.com/profile/06077102677195066016noreply@blogger.comtag:blogger.com,1999:blog-3366935554564939610.post-28993723478777185222011-04-02T10:32:34.167-04:002011-04-02T10:32:34.167-04:00Hello,
when you say:
"The silhouette can be...Hello,<br /><br />when you say:<br /><br />"The silhouette can be used to choose an appropriate value for k in k-means by trying each value of<br />k in the acceptable range and choosing the one that yields the best silhouette."<br /><br />What do you mean by "the acceptable range"? <br /><br />I'm trying to find out the natural k for a data set of 60 objects. I calculate the dataset's silhouette for all possible values of k (1 -> 60). All I see is the silhouette values tending to 1 when k tends to 60 wich i think makes sense... I was expecting some peak values to choose a natural k from, but I found many peaks and all of them tending to grow as we approach k=60.<br /><br />What do you think about that?<br /><br />Thank you for your time :)Henrique Alveshttps://www.blogger.com/profile/10314455414137959434noreply@blogger.com