A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs
Poojan Shah, Shashwat Agrawal, Ragesh Jaiswal

TL;DR
This paper introduces a rejection sampling method to accelerate the $k$-means++ seeding algorithm, achieving faster runtimes while maintaining provable approximation guarantees, and explores a new trade-off between computational cost and solution quality.
Contribution
It proposes a rejection sampling approach for $k$-means++ that reduces runtime and offers a novel trade-off between efficiency and clustering quality.
Findings
The first method runs in $ ilde{O}( ext{nnz}( ext{X}) + eta k^2 d)$ time with $O( ext{log} k)$ guarantees.
The second method introduces a scale-invariant factor improving the trade-off between cost and quality.
Empirical results validate the theoretical improvements on real datasets.
Abstract
The -++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the -means clustering problem where the goal is to cluster a dataset into clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being competitive with the optimal solution in expectation. However, its running time is , making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up -++. Our first method runs in time while still being competitive in expectation. Here, is a parameter which is the ratio of the variance of the dataset to the optimal - cost in expectation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Imbalanced Data Classification Techniques
