An Online Hierarchical Algorithm for Extreme Clustering
Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, Andrew McCallum

TL;DR
This paper presents PERCH, an online hierarchical clustering algorithm capable of efficiently handling extremely large datasets and numbers of clusters, with proven accuracy and scalability.
Contribution
The paper introduces PERCH, a novel non-greedy online hierarchical clustering algorithm that scales to both large data sizes and many clusters, with theoretical guarantees and superior empirical performance.
Findings
PERCH achieves higher clustering accuracy than existing methods.
PERCH scales efficiently with both data size and number of clusters.
The algorithm maintains perfect dendrogram purity under certain conditions.
Abstract
Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition · Advanced Image and Video Retrieval Techniques
