Distributed k-Means with Outliers in General Metrics
Enrico Dandolo, Andrea Pietracaprina, Geppino Pucci

TL;DR
This paper introduces a distributed coreset-based 3-round approximation algorithm for k-means with outliers in general metric spaces, achieving near-optimal solutions with low memory requirements and adaptability to dataset complexity.
Contribution
It presents the first distributed algorithm for k-means with outliers in general metrics that balances solution quality, memory efficiency, and dataset complexity adaptation.
Findings
Requires sublinear local memory per reducer.
Achieves an approximation ratio close to the best sequential algorithms.
Adapts to dataset complexity via doubling dimension D.
Abstract
Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set of points from a metric space and a parameter , requires to determine a subset of centers minimizing the sum of all squared distances of points in from their closest center. A more general formulation, known as k-means with outliers, introduced to deal with noisy datasets, features a further parameter and allows up to points of (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Advanced Clustering Algorithms Research · Face and Expression Recognition
