TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs
Laxman Dhulipala, Jason Lee, Jakub {\L}\k{a}cki, Vahab, Mirrokni

TL;DR
TeraHAC is a scalable, approximate hierarchical clustering algorithm designed for trillion-edge graphs, significantly reducing computation rounds and time while maintaining high clustering quality.
Contribution
It introduces a novel scalable approach combining nearest-neighbor chains with approximate HAC, enabling efficient clustering of massive graphs across multiple machines.
Findings
Requires over 100x fewer rounds than previous methods
Up to 8.3x faster than the state-of-the-art distributed algorithm
Achieves 1.16x higher clustering quality
Abstract
We introduce TeraHAC, a -approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing -approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of -approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Caching and Content Delivery
