Boosting K-means for Big Data by Fusing Data Streaming with Global   Optimization

Ravil Mussabayev; Rustam Mussabayev

arXiv:2410.14548·cs.LG·October 21, 2024

Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

Ravil Mussabayev, Rustam Mussabayev

PDF

Open Access

TL;DR

This paper introduces a novel heuristic that combines Variable Neighborhood Search with K-means to significantly improve clustering accuracy and efficiency on large datasets, setting new state-of-the-art results.

Contribution

The paper presents a new VNS-based heuristic algorithm that enhances K-means clustering for big data by optimizing partial objective functions through sequential neighborhood exploration.

Findings

01

Significantly improved clustering accuracy on real-world big datasets.

02

Enhanced efficiency of K-means in big data environments.

03

Achieved state-of-the-art performance in large-scale clustering tasks.

Abstract

K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Face and Expression Recognition · Machine Learning and Data Classification

Methodsk-Means Clustering