A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With   Improved Trade-Offs

Poojan Shah; Shashwat Agrawal; Ragesh Jaiswal

arXiv:2502.02085·cs.DS·February 5, 2025

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs

Poojan Shah, Shashwat Agrawal, Ragesh Jaiswal

PDF

Open Access

TL;DR

This paper introduces a rejection sampling method to accelerate the $k$-means++ seeding algorithm, achieving faster runtimes while maintaining provable approximation guarantees, and explores a new trade-off between computational cost and solution quality.

Contribution

It proposes a rejection sampling approach for $k$-means++ that reduces runtime and offers a novel trade-off between efficiency and clustering quality.

Findings

01

The first method runs in $ ilde{O}( ext{nnz}( ext{X}) + eta k^2 d)$ time with $O( ext{log} k)$ guarantees.

02

The second method introduces a scale-invariant factor improving the trade-off between cost and quality.

03

Empirical results validate the theoretical improvements on real datasets.

Abstract

The $k$ - $means$ ++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the $k$ -means clustering problem where the goal is to cluster a dataset $X \subset R^{d}$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O (lo g k)$ competitive with the optimal solution in expectation. However, its running time is $O (∣ X ∣ k d)$ , making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$ - $means$ ++. Our first method runs in time $\tilde{O} (nnz (X) + β k^{2} d)$ while still being $O (lo g k)$ competitive in expectation. Here, $β$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$ - $means$ cost in expectation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Imbalanced Data Classification Techniques