Communication-Optimal Distributed Clustering
Jiecao Chen, He Sun, David P. Woodruff, Qin Zhang

TL;DR
This paper develops communication-efficient distributed clustering protocols for graph and geometric data, demonstrating near-optimal theoretical bounds and practical efficiency, especially highlighting the advantages of broadcast channels.
Contribution
The work introduces nearly optimal distributed clustering protocols in both point-to-point and broadcast models, with theoretical bounds and practical validation.
Findings
Broadcast channels significantly reduce communication costs.
Protocols achieve near-matching lower bounds.
Algorithms perform well on real datasets.
Abstract
Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to spectrally cluster points or vertices in a graph distributed across servers, for a worst-case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Privacy-Preserving Technologies in Data · Complexity and Algorithms in Graphs
