On the Consistency of $k$-means++ algorithm
Mieczys{\l}aw A. K{\l}opotek

TL;DR
This paper proves that the expected clustering cost of the k-means++ algorithm on samples converges to the population expected value, enabling reliable subsampling for large datasets.
Contribution
It establishes the convergence of the sample-based k-means++ objective to the population objective, supporting its use in large-scale clustering.
Findings
Expected value of k-means++ objective converges to the population value
Sample-based approximation maintains constant factor guarantees
Supports subsampling for large datasets
Abstract
We prove in this paper that the expected value of the objective function of the -means++ algorithm for samples converges to population expected value. As -means++, for samples, provides with constant factor approximation for -means objectives, such an approximation can be achieved for the population with increase of the sample size. This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Statistical Methods and Inference · Face and Expression Recognition
