Fast k-means based on KNN Graph

Cheng-Hao Deng; Wan-Lei Zhao

arXiv:1705.01813·cs.LG·May 5, 2017·1 cites

Fast k-means based on KNN Graph

Cheng-Hao Deng, Wan-Lei Zhao

PDF

Open Access

TL;DR

This paper introduces a scalable k-means clustering method that leverages an approximate k-nearest neighbors graph to significantly reduce computation time, enabling clustering of very large datasets efficiently.

Contribution

The paper presents a novel k-means algorithm supported by an approximate KNN graph, constructed iteratively using the fast k-means itself, achieving massive speed-ups over traditional methods.

Findings

01

Achieves hundreds to thousands times speed-up compared to existing methods.

02

Successfully clusters 10 million 512-dimensional data points in 5.2 hours.

03

Maintains high clustering quality despite significant speed improvements.

Abstract

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Complex Network Analysis Techniques

Methodsk-Means Clustering