How to Use K-means for Big Data Clustering?
Rustam Mussabayev, Nenad Mladenovic, Bassem Jarboui, Ravil Mussabayev

TL;DR
This paper introduces a parallel K-means-based clustering method tailored for big data, demonstrating improved solution quality and runtime without relying on complex hybrid approaches or additional metaheuristics.
Contribution
A novel parallel scheme for K-means and K-means++ that effectively scales to big data and outperforms existing MSSC methods in quality and efficiency.
Findings
The new algorithm outperforms classical MSSC approaches in solution quality.
The approach significantly reduces runtime for big data clustering.
Data decomposition is effective for solving large-scale clustering problems.
Abstract
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications · Data Stream Mining Techniques
