Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization
Ravil Mussabayev, Rustam Mussabayev

TL;DR
This paper introduces a novel heuristic that combines Variable Neighborhood Search with K-means to significantly improve clustering accuracy and efficiency on large datasets, setting new state-of-the-art results.
Contribution
The paper presents a new VNS-based heuristic algorithm that enhances K-means clustering for big data by optimizing partial objective functions through sequential neighborhood exploration.
Findings
Significantly improved clustering accuracy on real-world big datasets.
Enhanced efficiency of K-means in big data environments.
Achieved state-of-the-art performance in large-scale clustering tasks.
Abstract
K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Face and Expression Recognition · Machine Learning and Data Classification
Methodsk-Means Clustering
