Sketch and Validate for Big Data Clustering

Panagiotis A. Traganitis; Konstantinos Slavakis; Georgios B. Giannakis

arXiv:1501.05590·stat.ML·November 17, 2016

Sketch and Validate for Big Data Clustering

Panagiotis A. Traganitis, Konstantinos Slavakis, Georgios B. Giannakis

PDF

TL;DR

This paper introduces the SkeVa framework for efficient big data clustering, employing novel sampling and validation algorithms that improve scalability and performance, including batch, streaming, and kernel-based methods.

Contribution

It develops a suite of SkeVa algorithms that extend RANSAC ideas to high-dimensional big data clustering with improved efficiency and flexibility.

Findings

01

Demonstrates competitive performance against state-of-the-art methods

02

Shows effectiveness on synthetic and real datasets

03

Provides scalable solutions for high-dimensional clustering

Abstract

In response to the need for learning tools tuned to big data analytics, the present paper introduces a framework for efficient clustering of huge sets of (possibly high-dimensional) data. Building on random sampling and consensus (RANSAC) ideas pursued earlier in a different (computer vision) context for robust regression, a suite of novel dimensionality and set-reduction algorithms is developed. The advocated sketch-and-validate (SkeVa) family includes two algorithms that rely on K-means clustering per iteration on reduced number of dimensions and/or feature vectors: The first operates in a batch fashion, while the second sequential one offers computational efficiency and suitability with streaming modes of operation. For clustering even nonlinearly separable vectors, the SkeVa family offers also a member based on user-selected kernel functions. Further trading off performance for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodsk-Means Clustering