Distributed $k$-Clustering for Data with Heavy Noise
Xiangyu Guo, Shi Li

TL;DR
This paper introduces a distributed clustering algorithm that effectively handles data with many outliers, reducing the number of discarded outliers while maintaining low communication costs and high solution quality.
Contribution
It improves outlier handling in distributed $k$-clustering by achieving near-optimal outlier count with constant approximation and low communication dependence on outlier number.
Findings
Achieves $(1+psilon)z$ outliers, the best possible, in distributed clustering.
Maintains $O(1)$-approximation ratio independent of outlier count.
Outperforms previous algorithms in communication efficiency and solution quality.
Abstract
In this paper, we consider the -center/median/means clustering with outliers problems (or the -center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on , the number of outliers. Recently Guha et al. overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with outliers. For the case where is large, the extra outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible , while maintaining the -approximation ratio and independence of communication cost on . The problems we consider include the -center problem, and -median/means problems in Euclidean metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Facility Location and Emergency Management · Privacy-Preserving Technologies in Data
