On the Consistency of $k$-means++ algorithm

Mieczys{\l}aw A. K{\l}opotek

arXiv:1702.06120·cs.LG·February 22, 2017·1 cites

On the Consistency of $k$-means++ algorithm

Mieczys{\l}aw A. K{\l}opotek

PDF

Open Access

TL;DR

This paper proves that the expected clustering cost of the k-means++ algorithm on samples converges to the population expected value, enabling reliable subsampling for large datasets.

Contribution

It establishes the convergence of the sample-based k-means++ objective to the population objective, supporting its use in large-scale clustering.

Findings

01

Expected value of k-means++ objective converges to the population value

02

Sample-based approximation maintains constant factor guarantees

03

Supports subsampling for large datasets

Abstract

We prove in this paper that the expected value of the objective function of the $k$ -means++ algorithm for samples converges to population expected value. As $k$ -means++, for samples, provides with constant factor approximation for $k$ -means objectives, such an approximation can be achieved for the population with increase of the sample size. This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Statistical Methods and Inference · Face and Expression Recognition