Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search
Jack Spalding-Jamieson, Eliot Wong Robson, Da Wei Zheng

TL;DR
This paper introduces a scalable approach for large-scale k-means clustering with very high values of k, focusing on improving the efficiency of the Lloyd's algorithm step using seeded approximate nearest-neighbor search methods.
Contribution
The paper proposes Seeded Search-Graph methods for seeded approximate nearest-neighbor search, enabling faster k-means clustering for massive high-dimensional datasets with large k.
Findings
Significantly reduces runtime of Lloyd's algorithm for large k
Effective for datasets with up to 10^9 points and high dimensions
Outperforms existing methods in speed and scalability
Abstract
For very large values of , we consider methods for fast -means clustering of massive datasets with points in high-dimensions (). All current practical methods for this problem have runtimes at least . We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Data Management and Algorithms · Advanced Clustering Algorithms Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
