It's Hard to HAC with Average Linkage!
MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis, Hershkowitz, Rajesh Jayaram, Jakub {\L}\k{a}cki

TL;DR
This paper establishes computational hardness results for average linkage hierarchical agglomerative clustering (HAC), showing it is unlikely to be efficiently parallelized or solved in near-linear time in general, but feasible in special cases.
Contribution
It provides the first hardness bounds for average linkage HAC, proving limitations on both sequential and parallel algorithms, and identifies specific cases where efficient solutions are possible.
Findings
Sequential algorithms have a runtime lower bound of n^{3/2 - ε}.
Average linkage HAC is CC-hard on simple graphs like trees of diameter 4.
Efficient parallelization is possible for paths and small-height hierarchies.
Abstract
Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of on node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter . On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
