TL;DR
This paper introduces a scalable hierarchical agglomerative clustering algorithm that maintains high quality, theoretically guarantees near-optimal solutions, and can handle billions of data points, outperforming existing methods.
Contribution
The authors develop a scalable, high-quality hierarchical clustering method with theoretical guarantees and demonstrate its effectiveness on large-scale datasets.
Findings
Achieves state-of-the-art results on clustering benchmarks.
Successfully applied to a dataset of 30 billion queries.
Provides a two-approximation to the DP-Means objective under mild conditions.
Abstract
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
