Relational Algorithms for k-means Clustering

Benjamin Moseley; Kirk Pruhs; Alireza Samadian; Yuyan Wang

arXiv:2008.00358·cs.DS·May 24, 2021

Relational Algorithms for k-means Clustering

Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang

PDF

TL;DR

This paper introduces an efficient relational algorithm for k-means clustering that operates directly on relational databases, leveraging rejection sampling and k-means++ to achieve constant-factor approximation.

Contribution

It presents the first efficient relational algorithm for k-means clustering and characterizes the limitations of relational algorithms in clustering tasks.

Findings

01

The proposed algorithm operates directly on relational data without joins.

02

Achieves an O(1)-approximate solution using rejection sampling and k-means++.

03

Shows NP-hardness of approximating cluster sizes in general relational inputs.

Abstract

This paper gives a k-means approximation algorithm that is efficient in the relational algorithms model. This is an algorithm that operates directly on a relational database without performing a join to convert it to a matrix whose rows represent the data points. The running time is potentially exponentially smaller than $N$ , the number of data points to be clustered that the relational database represents. Few relational algorithms are known and this paper offers techniques for designing relational algorithms as well as characterizing their limitations. We show that given two data points as cluster centers, if we cluster points according to their closest centers, it is NP-Hard to approximate the number of points in the clusters on a general relational input. This is trivial for conventional data inputs and this result exemplifies that standard algorithmic techniques may not be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.