Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Shuo Yang; Haocheng Xi; Yilong Zhao; Muyang Li; Xiaoze Fan; Jintao Zhang; Han Cai; Yujun Lin; Xiuyu Li; Kurt Keutzer; Song Han; Chenfeng Xu; Ion Stoica

arXiv:2603.09229·cs.DC·April 13, 2026

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica

PDF

1 Repo

TL;DR

Flash-KMeans introduces a GPU-optimized, IO-aware, and contention-free exact K-Means implementation that significantly accelerates clustering tasks in modern AI systems.

Contribution

It proposes novel kernel-level innovations and co-design strategies to overcome GPU bottlenecks, enabling fast and memory-efficient exact K-Means clustering.

Findings

01

Achieves up to 17.9× speedup over existing baselines.

02

Outperforms industry-standard libraries like cuML and FAISS by large margins.

03

Demonstrates practical deployability on NVIDIA H200 GPUs.

Abstract

$k$ -means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$ -means as an online primitive. We point out that existing GPU implementations of $k$ -means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

svg-project/flash-kmeans
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.