Noisy, Greedy and Not So Greedy k-means++

Anup Bhattacharya; Jan Eube; Heiko R\"oglin; Melanie Schmidt

arXiv:1912.00653·cs.DS·December 3, 2019·5 cites

Noisy, Greedy and Not So Greedy k-means++

Anup Bhattacharya, Jan Eube, Heiko R\"oglin, Melanie Schmidt

PDF

Open Access

TL;DR

This paper analyzes variations of the k-means++ algorithm, showing that greedy selection can perform poorly on certain instances, while a noisy variant maintains an $O( ext{log}^2 k)$ approximation, broadening understanding of seeding methods.

Contribution

It demonstrates that greedy k-means++ can have worse approximation ratios than standard k-means++, and introduces a noisy variant that guarantees an $O( ext{log}^2 k)$ approximation.

Findings

01

Greedy k-means++ can have an $oldsymbol{ ext{Omega}( ext{ell} imes ext{log} k)}$ approximation ratio.

02

Noisy k-means++ achieves an $oldsymbol{O( ext{log}^2 k)}$ approximation in expectation.

03

The study provides insights into the robustness and limitations of seeding strategies for k-means clustering.

Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii has become the most popular seeding method for Lloyd's algorithm. It samples the first center uniformly at random from the data set and the other $k - 1$ centers iteratively according to $D^{2}$ -sampling where the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of $O (lo g k)$ in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to $D^{2}$ -sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an $O (lo g k)$ -approximation (or even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques