An Observation on Lloyd's k-Means Algorithm in High Dimensions
David Silva-S\'anchez, Roy R. Lederman

TL;DR
This paper provides a theoretical analysis of Lloyd's k-means algorithm, revealing its failure modes in high-dimensional noisy data and identifying regimes where it converges to suboptimal fixed points.
Contribution
It offers a novel theoretical explanation for k-means failure in high dimensions using Gaussian Mixture Models, extending understanding of its limitations.
Findings
High-dimensional noise causes k-means to fail in identifying true clusters.
Almost every data partition can become a fixed point of k-means in certain regimes.
The analysis is motivated by applications like Cryo-EM and complex GMMs.
Abstract
Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications
