Distributed $k$-Clustering for Data with Heavy Noise

Xiangyu Guo; Shi Li

arXiv:1810.07852·cs.DC·November 30, 2018·5 cites

Distributed $k$-Clustering for Data with Heavy Noise

Xiangyu Guo, Shi Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a distributed clustering algorithm that effectively handles data with many outliers, reducing the number of discarded outliers while maintaining low communication costs and high solution quality.

Contribution

It improves outlier handling in distributed $k$-clustering by achieving near-optimal outlier count with constant approximation and low communication dependence on outlier number.

Findings

01

Achieves $(1+psilon)z$ outliers, the best possible, in distributed clustering.

02

Maintains $O(1)$-approximation ratio independent of outlier count.

03

Outperforms previous algorithms in communication efficiency and solution quality.

Abstract

In this paper, we consider the $k$ -center/median/means clustering with outliers problems (or the $(k, z)$ -center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on $z$ , the number of outliers. Recently Guha et al. overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with $2 z$ outliers. For the case where $z$ is large, the extra $z$ outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible $(1 + ϵ) z$ , while maintaining the $O (1)$ -approximation ratio and independence of communication cost on $z$ . The problems we consider include the $(k, z)$ -center problem, and $(k, z)$ -median/means problems in Euclidean metrics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xyguo/clusterz
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplexity and Algorithms in Graphs · Facility Location and Emergency Management · Privacy-Preserving Technologies in Data