TL;DR
Flash-KMeans introduces a GPU-optimized, IO-aware, and contention-free exact K-Means implementation that significantly accelerates clustering tasks in modern AI systems.
Contribution
It proposes novel kernel-level innovations and co-design strategies to overcome GPU bottlenecks, enabling fast and memory-efficient exact K-Means clustering.
Findings
Achieves up to 17.9× speedup over existing baselines.
Outperforms industry-standard libraries like cuML and FAISS by large margins.
Demonstrates practical deployability on NVIDIA H200 GPUs.
Abstract
-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable -means as an online primitive. We point out that existing GPU implementations of -means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
