Sketch and Validate for Big Data Clustering
Panagiotis A. Traganitis, Konstantinos Slavakis, Georgios B. Giannakis

TL;DR
This paper introduces the SkeVa framework for efficient big data clustering, employing novel sampling and validation algorithms that improve scalability and performance, including batch, streaming, and kernel-based methods.
Contribution
It develops a suite of SkeVa algorithms that extend RANSAC ideas to high-dimensional big data clustering with improved efficiency and flexibility.
Findings
Demonstrates competitive performance against state-of-the-art methods
Shows effectiveness on synthetic and real datasets
Provides scalable solutions for high-dimensional clustering
Abstract
In response to the need for learning tools tuned to big data analytics, the present paper introduces a framework for efficient clustering of huge sets of (possibly high-dimensional) data. Building on random sampling and consensus (RANSAC) ideas pursued earlier in a different (computer vision) context for robust regression, a suite of novel dimensionality and set-reduction algorithms is developed. The advocated sketch-and-validate (SkeVa) family includes two algorithms that rely on K-means clustering per iteration on reduced number of dimensions and/or feature vectors: The first operates in a batch fashion, while the second sequential one offers computational efficiency and suitability with streaming modes of operation. For clustering even nonlinearly separable vectors, the SkeVa family offers also a member based on user-selected kernel functions. Further trading off performance for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodsk-Means Clustering
