Fast and Accurate $k$-means++ via Rejection Sampling

Vincent Cohen-Addad; Silvio Lattanzi; Ashkan Norouzi-Fard and; Christian Sohler; Ola Svensson

arXiv:2012.11891·cs.LG·December 23, 2020·5 cites

Fast and Accurate $k$-means++ via Rejection Sampling

Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard and, Christian Sohler, Ola Svensson

PDF

Open Access 1 Video

TL;DR

This paper introduces a near-linear time algorithm for $k$-means++ seeding that maintains theoretical guarantees and significantly improves speed while preserving solution quality, making it more practical for large datasets.

Contribution

The paper proposes a novel near-linear time algorithm for $k$-means++ seeding that matches its theoretical guarantees and enhances efficiency over previous methods.

Findings

01

Algorithm is significantly faster than $k$-means++

02

Maintains the same theoretical guarantees as $k$-means++

03

Empirically achieves comparable solution quality

Abstract

$k$ -means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$ -means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$ -means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$ -means++ and significantly improves earlier results on fast $k$ -means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$ -means++ and obtains solutions of equivalent quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fast and Accurate $k$-means++ via Rejection Sampling· slideslive

Taxonomy

TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Data Stream Mining Techniques