Distributionally Robust K-Means Clustering
Vikrant Malik, Taylan Kargin, Babak Hassibi

TL;DR
This paper introduces a distributionally robust version of k-means clustering that enhances robustness to outliers, noise, and distribution shifts by considering an ambiguity set around the empirical distribution.
Contribution
It develops a Wasserstein-2 ball-based minimax formulation for k-means, leading to a soft-clustering scheme and an efficient algorithm with convergence guarantees.
Findings
Improved outlier detection in experiments.
Enhanced robustness to noise and distribution shifts.
Efficient algorithm with provable convergence.
Abstract
K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
