A sampling-based approach for efficient clustering in large datasets

Georgios Exarchakis; Omar Oubari; Gregor Lenz

arXiv:2112.14793·cs.LG·March 30, 2022·1 cites

A sampling-based approach for efficient clustering in large datasets

Georgios Exarchakis, Omar Oubari, Gregor Lenz

PDF

Open Access 1 Repo

TL;DR

This paper introduces a sampling-based clustering method that efficiently handles high-dimensional data with many clusters by reducing distance computations, matching the accuracy of exact solutions while being faster.

Contribution

The proposed algorithm significantly improves efficiency over k-means by avoiding all-to-all comparisons, maintaining optimal solutions, and outperforming existing approximation methods.

Findings

01

Achieves same optimal solutions as exact k-means

02

Reduces computational complexity and operations to convergence

03

Demonstrates superior stability and efficiency in clustering tasks

Abstract

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ooub/peregrine
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Stream Mining Techniques · Face and Expression Recognition