An Observation on Lloyd's k-Means Algorithm in High Dimensions

David Silva-S\'anchez; Roy R. Lederman

arXiv:2506.14952·stat.ML·June 19, 2025

An Observation on Lloyd's k-Means Algorithm in High Dimensions

David Silva-S\'anchez, Roy R. Lederman

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of Lloyd's k-means algorithm, revealing its failure modes in high-dimensional noisy data and identifying regimes where it converges to suboptimal fixed points.

Contribution

It offers a novel theoretical explanation for k-means failure in high dimensions using Gaussian Mixture Models, extending understanding of its limitations.

Findings

01

High-dimensional noise causes k-means to fail in identifying true clusters.

02

Almost every data partition can become a fixed point of k-means in certain regimes.

03

The analysis is motivated by applications like Cryo-EM and complex GMMs.

Abstract

Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications