Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and   Implementation

Jake Hyun

arXiv:2412.15295·cs.DS·December 25, 2024

Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and Implementation

Jake Hyun

PDF

Open Access 1 Repo

TL;DR

This paper presents optimized algorithms for 1D k-means clustering that leverage data structure for significant speedups, demonstrated through benchmarks and practical applications, bridging theory and implementation.

Contribution

It introduces novel, efficient algorithms for 1D k-means clustering using sorted data, prefix sums, and binary search, with proven complexity improvements and open-source implementation.

Findings

01

Achieves over 4500x speedup over scikit-learn on large datasets.

02

Attains 300x speedup in LLM quantization tasks.

03

Maintains clustering quality with improved computational efficiency.

Abstract

Clustering is a key task in machine learning, with $k$ -means being widely used for its simplicity and effectiveness. While 1D clustering is common, existing methods often fail to exploit the structure of 1D data, leading to inefficiencies. This thesis introduces optimized algorithms for $k$ -means++ initialization and Lloyd's algorithm, leveraging sorted data, prefix sums, and binary search for improved computational performance. The main contributions are: (1) an optimized $k$ -cluster algorithm achieving $O (l \cdot k^{2} \cdot lo g n)$ complexity for greedy $k$ -means++ initialization and $O (i \cdot k \cdot lo g n)$ for Lloyd's algorithm, where $l$ is the number of greedy $k$ -means++ local trials, and $i$ is the number of Lloyd's algorithm iterations, and (2) a binary search-based two-cluster algorithm, achieving $O (lo g n)$ runtime with deterministic convergence to a Lloyd's algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SyphonArch/flash1dkmeans
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition · Data Management and Algorithms