Relational Algorithms for k-means Clustering
Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang

TL;DR
This paper introduces an efficient relational algorithm for k-means clustering that operates directly on relational databases, leveraging rejection sampling and k-means++ to achieve constant-factor approximation.
Contribution
It presents the first efficient relational algorithm for k-means clustering and characterizes the limitations of relational algorithms in clustering tasks.
Findings
The proposed algorithm operates directly on relational data without joins.
Achieves an O(1)-approximate solution using rejection sampling and k-means++.
Shows NP-hardness of approximating cluster sizes in general relational inputs.
Abstract
This paper gives a k-means approximation algorithm that is efficient in the relational algorithms model. This is an algorithm that operates directly on a relational database without performing a join to convert it to a matrix whose rows represent the data points. The running time is potentially exponentially smaller than , the number of data points to be clustered that the relational database represents. Few relational algorithms are known and this paper offers techniques for designing relational algorithms as well as characterizing their limitations. We show that given two data points as cluster centers, if we cluster points according to their closest centers, it is NP-Hard to approximate the number of points in the clusters on a general relational input. This is trivial for conventional data inputs and this result exemplifies that standard algorithmic techniques may not be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
