Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Approaches to hierarchical clustering are typically reduced to two types:
 Agglomerative: this is a "bottomup" approach, where each observation starts in its own cluster and the pairs of clusters are merged as one moves up the hierarchy.
 Divisive: this is a "topdown" approach, where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In this article, we will consider the usage of the Agglomerative type.
Generally, the merges are produced in a “greedy” manner. And the results of hierarchical clustering are usually presented in a dendrogram.
The complexity of agglomerative clustering is estimated as O(N^{3}), which explains the low speed of working with large data sets. However, there can be a more efficient method of realization for some special cases.
Suppose that there is a set of N items to be clustered. Then, there will be an NxN distance matrix and the basic process of hierarchical clustering will consist of the next steps:
Each element is assigned to a separate cluster, i.e. the number of clusters will be initially equal to the number of elements. It is assumed that the distance between the clusters is the distance between the elements they contain.
 The clusters closest to each other are paired. They turn into one new cluster.
 The distance between the new and old clusters computes according to the described method.
 The second and third steps are repeated until there is a cluster containing all the elements.
 The third step can be implemented in several ways: singlelink, completelink and averagelink clustering.
We will consider an averagelink clustering algorithm.
In averagelink clustering, the distance between the clusters is determined by the average distance between their elements.
There is an existing averagelink hierarchical clustering algorithm with the complexity of O(N^{2}*log N), which requires storing O(N^{2} * (2 * sizeof(dataType) + sizeof(indexType)) elements. We used the same approach to improve the algorithm’s complexity, but this implementation requires storing only O(N^{2} * (sizeof(dataType) + 2 * sizeof(indexType))) elements, where size of(dataType) > sizeof(indexType).
Example #1
Abstract raw data before building a dendrogram.
The hierarchical clustering dendrogram will look like this after the algorithm running:
Algorithm improvements
The initial algorithm has some issues with large amounts of data. So we applied the next improvements to minimize the processing time and memory consumption.
1) All the information is stored only for unused elements (including those that were produced by combining). As the matrix of distances is symmetric, all calculations and operations are produced by its lower triangular shape.
public float getCopheneticCorrelation(float[][] matrix) { matrix = Utils.toTriangularMatrix(matrix); Dendrogram dendrogram = getDendrogram(matrix); … }
activeRows = new TIntNSet(totalRowsCount); for (int row = 0; row < rowsCount; ++row) { activeRows.addLast(row); }
We deactivate two rows, and activate a new one, which is the result of merging the ones we’ve deactivated.
… deactivateRow(minRow); deactivateRow(minRowColumn); …
… activateRow(addedRow); … protected void deactivateRow(final int row) { activeRows.remove(row); tree.set(row, 0); } protected void activateRow(final int row) { activeRows.addLast(row); tree.set(row, 1); }
2) Because the used matrix is a lower triangular shape and sorted in each row elements, it becomes impossible to update the elements’ distance to the new cluster in O(1) without a memory increase. Therefore, we add a new option (that saves the layout of the elements after sorting) to avoid this.
this.rowColumns = new TShortIntArray[totalRowsCount]; //columns before sorting this.rowColumnIndexes = new TShortIntArray[totalRowsCount]; //column positions after sorting for (int row = 0; row < rowsCount; ++row) { Utils.DoubleValueIndexPair[] pairs = new Utils.DoubleValueIndexPair[row]; for (int column = 0; column < row; ++column) { pairs[column] = new Utils.DoubleValueIndexPair(matrix[row][column], column); } Arrays.sort(pairs); rowColumns[row] = new TShortIntArray(row, row); rowColumnIndexes[row] = new TShortIntArray(row, row); for (int index = 0; index < row; ++index) { int column = pairs[index].index; rowColumns[row].setValue(index, column); rowColumnIndexes[row].setValue(column, index); } }
3) We can use the information about the deactivated rows for getting the correct index of the needed column position in the third parameter. The initial complexity of this operation is O(N). However, we can reduce the complexity to O(log N) by using the segment tree structure.
4) However, the segment tree structure increases memory consumption. Each added cluster requires a separate version of a tree, which in its turn requires O(N^{2}) memory (thus up to O(N^{3}) in general). Therefore, we added persistency to the segment tree structure for optimization. This approach will minimize memory consumption up to O(log N). It will be very significant when working with a large amount of data.
this.tree = new TPersistentSumSegmentTree(states, maxChangesCount);
We get the distance from both elements and then get an average distance to merge them.
float minRowDistance = getDistance(minRow, currentRow); float minColumnDistance = getDistance(minRowColumn, currentRow); float addedRowDistance = getMergedDistance(minRow, minRowDistance, minRowColumn, minColumnDistance, addedRow); protected int getColumnIndex(int row, int column) { if (row < rowsCount) { return column; } int changeRootIndex = (row  rowsCount + 1) * 3; int columnIndex = tree.getSum(changeRootIndex, 0, column)  1; return columnIndex; }
protected float getMergedDistance(int left, float leftDistance, int right, float rightDistance, int root) { float rootDistance = leftDistance * sizes[left] + rightDistance * sizes[right]; rootDistance /= sizes[root]; return rootDistance; }
Example #2
The next pages trace a hierarchical clustering of distances in miles between U.S. cities.
Input distance matrix:

BOS 
NY 
DC 
MIA 
CHI 
SEA 
SF 
LA 
DEN 
BOS 
0 








NY 
206 
0 







DC 
429 
233 
0 






MIA 
1504 
1308 
1075 
0 





CHI 
963 
802 
671 
1329 
0 




SEA 
2976 
2815 
2684 
3273 
2013 
0 



SF 
3095 
2934 
2799 
3053 
2142 
808 
0 


LA 
2979 
2786 
2631 
2687 
2054 
1131 
379 
0 

DEN 
1949 
1771 
1616 
2037 
996 
1307 
1235 
1059 
0 
The nearest pair of cities is BOS and NY at a distance of 206. These are merged into a single cluster called "BOS/NY".
Then, we compute the distance from this new compound object to all other objects. The distance from "BOS/NY" to DC is 331 and it corresponds to the average distance from NY and BOS to DC. Similarly, the distance from "BOS/NY" to DEN will be 1,860.
After merging BOS with NY:

BOS/NY 
DC 
MIA 
CHI 
SEA 
SF 
LA 
DEN 
BOS/NY 
0 







DC 
331 
0 






MIA 
1406 
1075 
0 





CHI 
882.5 
671 
1329 
0 




SEA 
2895.5 
2684 
3273 
2013 
0 



SF 
3014.5 
2799 
3053 
2142 
808 
0 


LA 
2882.5 
2631 
2687 
2054 
1131 
379 
0 

DEN 
1860 
1616 
2037 
996 
1307 
1235 
1059 
0 
The nearest pair of objects is BOS/NY and DC at a distance of 331. These are merged into a single cluster called "BOS/NY/DC". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:
After merging DC with BOSNY:

BOS/NY/DC 
MIA 
CHI 
SEA 
SF 
LA 
DEN 
BOS/NY/DC 
0 






MIA 
1240.5 
0 





CHI 
776.75 
1329 
0 




SEA 
2789.75 
3273 
2013 
0 



SF 
2906.75 
3053 
2142 
808 
0 


LA 
2756.75 
2687 
2054 
1131 
379 
0 

DEN 
1738 
2037 
996 
1307 
1235 
1059 
0 
Now, the nearest pair of objects is SF and LA at a distance of 379. These are merged into a single cluster called "SF/LA". Then we compute the distance from this new cluster to all other objects to get a new distance matrix:
After merging SF with LA:

BOS/NY/DC 
MIA 
CHI 
SEA 
SF/LA 
DEN 
BOS/NY/DC 
0 





MIA 
1240.5 
0 




CHI 
776.75 
1329 
0 



SEA 
2789.75 
3273 
2013 
0 


SF/LA 
2831.75 
2870 
2098 
969.5 
0 

DEN 
1738 
2037 
996 
1307 
1147 
0 
Now, the nearest pair of objects is CHI and BOS/NY/DC at a distance of 776.75. These are merged into a single cluster called "BOS/NY/DC/CHI". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:
After merging CHI with BOS/NY/DC:

BOS/NY/DC/CHI 
MIA 
SEA 
SF/LA 
DEN 
BOS/NY/DC/CHI 
0 




MIA 
1284.75 
0 



SEA 
2401.375 
3273 
0 


SF/LA 
2464.875 
2687 
808 
0 

DEN 
1367 
2037 
1307 
1147 
0 
Now the nearest pair of objects is SEA and SF/LA at distance 808. These are merged into a single cluster called "SF/LA/SEA". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:
After merging SEA with SF/LA:

BOS/NY/DC/CHI 
MIA 
SF/LA/SEA 
DEN 
BOS/NY/DC/CHI 
0 



MIA 
1284.75 
0 


SF/LA/SEA 
2433.125 
2980 
0 

DEN 
1367 
2037 
1227 
0 
Now, the nearest pair of objects is DEN and SF/LA/SEA at a distance of 1,227. These are merged into a single cluster called "SF/LA/SEA/DEN". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:
After merging DEN with SF/LA/SEA:

BOS/NY/DC/CHI 
MIA 
SF/LA/SEA/DEN 
BOS/NY/DC/CHI 
0 


MIA 
1284.75 
0 

SF/LA/SEA/DEN 
1900.063 
2508.5 
0 
Now the nearest pair of objects is BOS/NY/DC/CHI and MIA at a distance of 1,284.75. These are merged into a single cluster called "BOS/NY/DC/CHI/MIA". Then, we compute the distance from this new compound object to all other objects to get a new distance matrix:
After merging BOS/NY/DC/CHI with MIA:

BOS/NY/DC/CHI/MIA 
SF/LA/SEA/DEN 
BOS/NY/DC/CHI/MIA 
0 

SF/LA/SEA/DEN 
1896.625 
0 
Finally, we merge the last two clusters at level 1896.625.
In this article, we reviewed the following aspects:
 The original hierarchical clustering algorithm
 Algorithm modifications, decreasing the algorithm’s complexity from O(N^{3}) up to O(N^{2} log N)
 Modifications that allow for decreasing the amount of used memory from O(N^{3}) up to O(N^{2} log N) without increasing the algorithm’s complexity up to the original O(N^{3})
 Example of a general algorithm
References
 D'andrade,R. 1978, "UStatistic Hierarchical Clustering" Psychometrika, 4:5867.
 Johnson,S.C. 1967, "Hierarchical Clustering Schemes" Psychometrika, 2:241254.
 Kaufman, L.; Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (1 ed.). New York: John Wiley. ISBN 0471878766.
 http://nlp.stanford.edu/IRbook/html/htmledition/timecomplexityofhac1.html
If you need development assistance in a similar project or you have questions about the algorithm implementation, please write to us at hello@waveaccess.com and our experts will help you.