The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It
Roy R. Lederman, David Silva-S\'anchez, Ziling Chen, Gilles Mordant, Amnon Balanov, Tamir Bendory

TL;DR
This paper demonstrates that Lloyd's k-means algorithm fails catastrophically in high-dimensional noisy data, often returning the initial partition, while Hartigan's algorithm avoids this issue, explaining empirical difficulties with k-means.
Contribution
The paper provides a theoretical analysis showing Lloyd's k-means fails in high dimensions, unlike Hartigan's algorithm, highlighting the importance of algorithm choice in high-dimensional clustering.
Findings
Lloyd's k-means often returns initial partitions in high dimensions
Hartigan's k-means avoids the catastrophic failure in high-dimensional settings
Theoretical explanation for empirical difficulties of k-means in high dimensions
Abstract
Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Distributed systems and fault tolerance · Stochastic Gradient Optimization Techniques
