Scalable Hierarchical Agglomerative Clustering

Nicholas Monath; Avinava Dubey; Guru Guruganesh; Manzil Zaheer; Amr; Ahmed; Andrew McCallum; Gokhan Mergen; Marc Najork; Mert Terzihan; Bryon; Tjanaka; Yuan Wang; Yuchen Wu

arXiv:2010.11821·cs.LG·October 1, 2021

Scalable Hierarchical Agglomerative Clustering

Nicholas Monath, Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr, Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork, Mert Terzihan, Bryon, Tjanaka, Yuan Wang, Yuchen Wu

PDF

2 Repos

TL;DR

This paper introduces a scalable hierarchical agglomerative clustering algorithm that maintains high quality, theoretically guarantees near-optimal solutions, and can handle billions of data points, outperforming existing methods.

Contribution

The authors develop a scalable, high-quality hierarchical clustering method with theoretical guarantees and demonstrate its effectiveness on large-scale datasets.

Findings

01

Achieves state-of-the-art results on clustering benchmarks.

02

Successfully applied to a dataset of 30 billion queries.

03

Provides a two-approximation to the DP-Means objective under mild conditions.

Abstract

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.