Fast and Accurate $k$-means++ via Rejection Sampling
Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard and, Christian Sohler, Ola Svensson

TL;DR
This paper introduces a near-linear time algorithm for $k$-means++ seeding that maintains theoretical guarantees and significantly improves speed while preserving solution quality, making it more practical for large datasets.
Contribution
The paper proposes a novel near-linear time algorithm for $k$-means++ seeding that matches its theoretical guarantees and enhances efficiency over previous methods.
Findings
Algorithm is significantly faster than $k$-means++
Maintains the same theoretical guarantees as $k$-means++
Empirically achieves comparable solution quality
Abstract
-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, -means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for -means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as -means++ and significantly improves earlier results on fast -means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than -means++ and obtains solutions of equivalent quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Data Stream Mining Techniques
