Improved Outlier Robust Seeding for k-means
Amit Deshpande, Rameshwar Pratap

TL;DR
This paper introduces a robust variant of the $D^2$ sampling seeding method for $k$-means clustering that effectively handles outliers, providing provable approximation guarantees and improved empirical performance over existing methods.
Contribution
A simple, linear-time algorithm that enhances $D^2$ sampling for robust $k$-means by handling outliers, with theoretical guarantees and practical improvements.
Findings
Outperforms $k$-means++ and other seeding methods on real and synthetic data.
Provides a provable $O(1)$ approximation guarantee in the presence of outliers.
Runs in $O(ndk)$ time and can output exactly $k$ clusters.
Abstract
The -means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called -means++ uses sampling and comes with a provable approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} -means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the sampling distribution, which makes it robust to the outliers. Our algorithm runs in time, outputs clusters, discards marginally more points than the optimal number of outliers, and comes with a provable approximation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Sparse and Compressive Sensing Techniques · Imbalanced Data Classification Techniques
