An efficient K-means algorithm for Massive Data

Marco Cap\'o; Aritz P\'erez; Jos\'e Antonio Lozano

arXiv:1605.02989·stat.ML·May 11, 2016·1 cites

An efficient K-means algorithm for Massive Data

Marco Cap\'o, Aritz P\'erez, Jos\'e Antonio Lozano

PDF

Open Access

TL;DR

This paper introduces an efficient approximation method for K-means clustering on massive datasets by recursive partitioning and local representation, significantly reducing distance computations while maintaining high clustering quality.

Contribution

It proposes a novel recursive partitioning approach with local data representation to improve K-means efficiency on large-scale data.

Findings

01

Outperforms K-means++ and minibatch K-means in reducing distance computations.

02

Maintains comparable clustering quality with fewer distance calculations.

03

Provides theoretical insights into the method's properties.

Abstract

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm remains as one of the most popular clustering methods, in spite of its dependency on the initial settings and high computational cost, especially in terms of distance computations. In this work, we propose an efficient approximation to the K-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of sub- sets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the K-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Data Stream Mining Techniques