Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.
Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.
The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.
Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.
My proposed clustering method repeatedly does the following:
- Finds the two variables with the highest correlation.
- Calculates the principal component for these variables and adds it into the data.
- Maintains the information that the two variables have been combined.
proc sql;
....select colname
....from columns
....where counter <= [some number] <>
These variables can then be used for predictive models or visualization purposes.
The inner loop of the code works by doing the following:
- Calling proc corr to calculate the correlation of all variables not already in a cluster.
- Transposing the correlations into a table with three columns, two for the variables and one for the correlation using proc transpose.
- Finding the pair of variables with the largest correlation.
- Calculating the first principal component for these variables.
- Appending this principal component to the data set.
- Updating the columns data set with information about the new cluster.
You might be interested in reading:
ReplyDeleteA data-driven functional projection approach for the selection of feature ranges in spectra with ICA or cluster analysis
Catherine Krier, Fabrice Rossi, Damien François and Michel Verleysen
Chemometrics and Intelligent Laboratory Systems, Elsevier, Vol. 91, No. 1 (15 March 2008), pp. 43-53.
http://www.dice.ucl.ac.be/~verleyse/papers/cils08ck.pdf
and the references therein
Sas Proc VarClus
ReplyDeletePROC VARCLUS is specifically my inspiration for thinking about an agglomerative approach to clustering variables. PROC VARCLUS implements various divisive methods, where all the variables are included in a single cluster, and this gets broken into smaller clusters. I realize that when I first wrote this post, I used the term "hierarchical" when I should have used the term "agglomerative"; I have since fixed that.
ReplyDeleteIf your goal is to reduce the number of features, I don't see why you are using first principle of component. Although you only use first principle, it is a linear combination of the two variables. So in the actual fact you have not reduced the number of variables.
ReplyDeleteHow is this an improvement from performing Pricipal Component Analysis for all the variables and considering only the first n Principal components?
ReplyDelete