TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala; Jason Lee; Jakub {\L}\k{a}cki; Vahab; Mirrokni

arXiv:2308.03578·cs.DS·June 12, 2024·1 cites

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala, Jason Lee, Jakub {\L}\k{a}cki, Vahab, Mirrokni

PDF

Open Access

TL;DR

TeraHAC is a scalable, approximate hierarchical clustering algorithm designed for trillion-edge graphs, significantly reducing computation rounds and time while maintaining high clustering quality.

Contribution

It introduces a novel scalable approach combining nearest-neighbor chains with approximate HAC, enabling efficient clustering of massive graphs across multiple machines.

Findings

01

Requires over 100x fewer rounds than previous methods

02

Up to 8.3x faster than the state-of-the-art distributed algorithm

03

Achieves 1.16x higher clustering quality

Abstract

We introduce TeraHAC, a $(1 + ϵ)$ -approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1 + ϵ)$ -approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1 + ϵ)$ -approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Caching and Content Delivery