Improved Outlier Robust Seeding for k-means

Amit Deshpande; Rameshwar Pratap

arXiv:2309.02710·cs.LG·September 7, 2023

Improved Outlier Robust Seeding for k-means

Amit Deshpande, Rameshwar Pratap

PDF

Open Access

TL;DR

This paper introduces a robust variant of the $D^2$ sampling seeding method for $k$-means clustering that effectively handles outliers, providing provable approximation guarantees and improved empirical performance over existing methods.

Contribution

A simple, linear-time algorithm that enhances $D^2$ sampling for robust $k$-means by handling outliers, with theoretical guarantees and practical improvements.

Findings

01

Outperforms $k$-means++ and other seeding methods on real and synthetic data.

02

Provides a provable $O(1)$ approximation guarantee in the presence of outliers.

03

Runs in $O(ndk)$ time and can output exactly $k$ clusters.

Abstract

The $k$ -means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called $k$ -means++ uses $D^{2}$ sampling and comes with a provable $O (lo g k)$ approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, $D^{2}$ sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} $k$ -means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the $D^{2}$ sampling distribution, which makes it robust to the outliers. Our algorithm runs in $O (n d k)$ time, outputs $O (k)$ clusters, discards marginally more points than the optimal number of outliers, and comes with a provable $O (1)$ approximation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Sparse and Compressive Sensing Techniques · Imbalanced Data Classification Techniques