Scalable Exact Hierarchical Agglomerative Clustering via Sparse Geographic Distance Graphs
Victor Maus, Vinicius Pozzobon Borin

TL;DR
GSHAC enables exact hierarchical clustering of millions of geographic points efficiently by using a sparse distance graph, significantly reducing memory and computation compared to traditional methods.
Contribution
The paper introduces GSHAC, a scalable system for exact HAC on large spatial datasets using sparse geographic graphs and spatial indexing.
Findings
GSHAC completes clustering of 261,073 points in 12 seconds with 109 MiB memory.
On 2 million points, GSHAC finishes in under 3 minutes with peak memory under 3 GiB.
Connected components in the sparse graph allow exact clustering for standard linkage methods.
Abstract
Exact hierarchical agglomerative clustering (HAC) of large spatial datasets is limited in practice by the time and memory required for the full pairwise distance matrix. We present GSHAC (Geographically Sparse Hierarchical Agglomerative Clustering), a system that makes exact HAC feasible at scales of millions of geographic features on a commodity workstation. GSHAC replaces the distance matrix with a sparse geographic distance graph containing only pairs within a user-specified geodesic bound~, constructed in time via spatial indexing, where~ is the mean number of neighbors within~. Connected components of this graph define independent subproblems, and we prove that the resulting assignments are exact for all standard linkage methods at any cut height . For single linkage, an MST-based path keeps memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
