Hierarchical clustering with dot products recovers hidden tree structure
Annie Gray, Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley

TL;DR
This paper introduces a simple hierarchical clustering method based on maximum average dot product, which effectively recovers underlying tree structures in data, outperforming existing algorithms in both theory and real data applications.
Contribution
The paper proposes a novel clustering variant that uses dot products for merging, providing theoretical guarantees for recovering true hierarchical structures under a probabilistic model.
Findings
The dot product-based clustering accurately recovers hierarchical structure in synthetic data.
The method outperforms UPGMA, Ward's, and HDBSCAN on real datasets.
Theoretical analysis links data geometry to tree recovery performance.
Abstract
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Bayesian Methods and Mixture Models
