Blog & News

Articles, insights and thinking from software development vendor

# Memory reduction for average-link Hierarchical Clustering

Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Approaches to hierarchical clustering are typically reduced to two types:

• Agglomerative: this is a "bottom-up" approach, where each observation starts in its own cluster and the pairs of clusters are merged as one moves up the hierarchy.
• Divisive: this is a "top-down" approach, where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In this article, we will consider the usage of the Agglomerative type.

Generally, the merges are produced in a “greedy” manner. And the results of hierarchical clustering are usually presented in a dendrogram.

The complexity of agglomerative clustering is estimated as O(N3), which explains the low speed of working with large data sets. However, there can be a more efficient method of realization for some special cases.

Suppose that there is a set of N items to be clustered. Then, there will be an NxN distance matrix and the basic process of hierarchical clustering will consist of the next steps:

Each element is assigned to a separate cluster, i.e. the number of clusters will be initially equal to the number of elements. It is assumed that the distance between the clusters is the distance between the elements they contain.

• The clusters closest to each other are paired. They turn into one new cluster.
• The distance between the new and old clusters computes according to the described method.
• The second and third steps are repeated until there is a cluster containing all the elements.

We will consider an average-link clustering algorithm.

In average-link clustering, the distance between the clusters is determined by the average distance between their elements.

There is an existing average-link hierarchical clustering algorithm with the complexity of O(N2*log N), which requires storing O(N2 * (2 * sizeof(dataType) + sizeof(indexType)) elements. We used the same approach to improve the algorithm’s complexity, but this implementation requires storing only O(N2 * (sizeof(dataType) + 2 * sizeof(indexType))) elements, where size of(dataType) > sizeof(indexType).

Example #1

Abstract raw data before building a dendrogram. The hierarchical clustering dendrogram will look like this after the algorithm running: Algorithm improvements

The initial algorithm has some issues with large amounts of data. So we applied the next improvements to minimize the processing time and memory consumption.

1) All the information is stored only for unused elements (including those that were produced by combining). As the matrix of distances is symmetric, all calculations and operations are produced by its lower triangular shape.

```public float getCopheneticCorrelation(float[][] matrix) {

matrix = Utils.toTriangularMatrix(matrix);

Dendrogram dendrogram = getDendrogram(matrix);
…
}```
```activeRows = new TIntNSet(totalRowsCount);
for (int row = 0; row < rowsCount; ++row) {
}```

We deactivate two rows, and activate a new one, which is the result of merging the ones we’ve deactivated.

```…
deactivateRow(minRow);
deactivateRow(minRowColumn);
…```
```…
…
protected void deactivateRow(final int row) {
activeRows.remove(row);
tree.set(row, 0);
}

protected void activateRow(final int row) {
tree.set(row, 1);
}```

2) Because the used matrix is a lower triangular shape and sorted in each row elements, it becomes impossible to update the elements’ distance to the new cluster in O(1) without a memory increase. Therefore, we add a new option (that saves the layout of the elements after sorting) to avoid this.

```this.rowColumns = new TShortIntArray[totalRowsCount]; //columns before sorting
this.rowColumnIndexes = new TShortIntArray[totalRowsCount]; //column positions after sorting

for (int row = 0; row < rowsCount; ++row) {
Utils.DoubleValueIndexPair[] pairs = new Utils.DoubleValueIndexPair[row];
for (int column = 0; column < row; ++column) {
pairs[column] = new Utils.DoubleValueIndexPair(matrix[row][column], column);
}

Arrays.sort(pairs);

rowColumns[row] = new TShortIntArray(row, row);
rowColumnIndexes[row] = new TShortIntArray(row, row);

for (int index = 0; index < row; ++index) {
int column = pairs[index].index;
rowColumns[row].setValue(index, column);
rowColumnIndexes[row].setValue(column, index);
}
}```

3) We can use the information about the deactivated rows for getting the correct index of the needed column position in the third parameter. The initial complexity of this operation is O(N). However, we can reduce the complexity to O(log N) by using the segment tree structure.

4) However, the segment tree structure increases memory consumption. Each added cluster requires a separate version of a tree, which in its turn requires  O(N2) memory (thus up to O(N3) in general). Therefore, we added persistency to the segment tree structure for optimization. This approach will minimize memory consumption up to O(log N). It will be very significant when working with a large amount of data.

`this.tree = new TPersistentSumSegmentTree(states, maxChangesCount);`

We get the distance from both elements and then get an average distance to merge them.

```float minRowDistance = getDistance(minRow, currentRow);
float minColumnDistance = getDistance(minRowColumn, currentRow);

protected int getColumnIndex(int row, int column) {
if (row < rowsCount) {
return column;
}

int changeRootIndex = (row - rowsCount + 1) * 3;
int columnIndex = tree.getSum(changeRootIndex, 0, column) - 1;

return columnIndex;
}
```
```protected float getMergedDistance(int left, float leftDistance, int right, float rightDistance, int root) {
float rootDistance = leftDistance * sizes[left] + rightDistance * sizes[right];
rootDistance /= sizes[root];
return rootDistance;
}
```

Example #2

The next pages trace a hierarchical clustering of distances in miles between U.S. cities.

Input distance matrix:

 BOS NY DC MIA CHI SEA SF LA DEN BOS 0 NY 206 0 DC 429 233 0 MIA 1504 1308 1075 0 CHI 963 802 671 1329 0 SEA 2976 2815 2684 3273 2013 0 SF 3095 2934 2799 3053 2142 808 0 LA 2979 2786 2631 2687 2054 1131 379 0 DEN 1949 1771 1616 2037 996 1307 1235 1059 0

The nearest pair of cities is BOS and NY at a distance of 206. These are merged into a single cluster called "BOS/NY".

Then, we compute the distance from this new compound object to all other objects. The distance from "BOS/NY" to DC is 331 and it corresponds to the average distance from NY and BOS to DC. Similarly, the distance from "BOS/NY" to DEN will be 1,860.

After merging BOS with NY:

 BOS/NY DC MIA CHI SEA SF LA DEN BOS/NY 0 DC 331 0 MIA 1406 1075 0 CHI 882.5 671 1329 0 SEA 2895.5 2684 3273 2013 0 SF 3014.5 2799 3053 2142 808 0 LA 2882.5 2631 2687 2054 1131 379 0 DEN 1860 1616 2037 996 1307 1235 1059 0

The nearest pair of objects is BOS/NY and DC at a distance of 331. These are merged into a single cluster called "BOS/NY/DC". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:

After merging DC with BOS-NY:

 BOS/NY/DC MIA CHI SEA SF LA DEN BOS/NY/DC 0 MIA 1240.5 0 CHI 776.75 1329 0 SEA 2789.75 3273 2013 0 SF 2906.75 3053 2142 808 0 LA 2756.75 2687 2054 1131 379 0 DEN 1738 2037 996 1307 1235 1059 0

Now, the nearest pair of objects is SF and LA at a distance of 379. These are merged into a single cluster called "SF/LA". Then we compute the distance from this new cluster to all other objects to get a new distance matrix:

After merging SF with LA:

 BOS/NY/DC MIA CHI SEA SF/LA DEN BOS/NY/DC 0 MIA 1240.5 0 CHI 776.75 1329 0 SEA 2789.75 3273 2013 0 SF/LA 2831.75 2870 2098 969.5 0 DEN 1738 2037 996 1307 1147 0

Now, the nearest pair of objects is CHI and BOS/NY/DC at a distance of 776.75. These are merged into a single cluster called "BOS/NY/DC/CHI". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:

After merging CHI with BOS/NY/DC:

 BOS/NY/DC/CHI MIA SEA SF/LA DEN BOS/NY/DC/CHI 0 MIA 1284.75 0 SEA 2401.375 3273 0 SF/LA 2464.875 2687 808 0 DEN 1367 2037 1307 1147 0

Now the nearest pair of objects is SEA and SF/LA at distance 808. These are merged into a single cluster called "SF/LA/SEA". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:

After merging SEA with SF/LA:

 BOS/NY/DC/CHI MIA SF/LA/SEA DEN BOS/NY/DC/CHI 0 MIA 1284.75 0 SF/LA/SEA 2433.125 2980 0 DEN 1367 2037 1227 0

Now, the nearest pair of objects is DEN and SF/LA/SEA at a distance of 1,227. These are merged into a single cluster called "SF/LA/SEA/DEN". Then, we compute the distance from this new cluster to all other clusters to get a new distance matrix:

After merging DEN with SF/LA/SEA:

 BOS/NY/DC/CHI MIA SF/LA/SEA/DEN BOS/NY/DC/CHI 0 MIA 1284.75 0 SF/LA/SEA/DEN 1900.063 2508.5 0

Now the nearest pair of objects is BOS/NY/DC/CHI and MIA at a distance of 1,284.75. These are merged into a single cluster called "BOS/NY/DC/CHI/MIA". Then, we compute the distance from this new compound object to all other objects to get a new distance matrix:

After merging BOS/NY/DC/CHI with MIA:

 BOS/NY/DC/CHI/MIA SF/LA/SEA/DEN BOS/NY/DC/CHI/MIA 0 SF/LA/SEA/DEN 1896.625 0

Finally, we merge the last two clusters at level 1896.625.

• The original hierarchical clustering algorithm
• Algorithm modifications, decreasing the algorithm’s complexity from O(N3) up to O(N2 log N)
• Modifications that allow for decreasing the amount of used memory from O(N3) up to O(N2 log N) without increasing the algorithm’s complexity up to the original O(N3)
• Example of a general algorithm

References

• D'andrade,R. 1978, "U-Statistic Hierarchical Clustering" Psychometrika, 4:58-67.
• Johnson,S.C. 1967, "Hierarchical Clustering Schemes" Psychometrika, 2:241-254.
• Kaufman, L.; Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (1 ed.). New York: John Wiley. ISBN 0-471-87876-6.
• http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html

If you need development assistance in a similar project or you have questions about the algorithm implementation, please write to us at hello@wave-access.com and our experts will help you.

### Order a phone call

Convenient time to call:

Cancel

### Get started

Attach
Your file up to 30 mb
Cancel
We use cookies to make your experience of our website better and assist in our marketing efforts. By continuing to browse or by clicking “Accept” you agree to our use of cookies as described in our Privacy Policy.