Optimal Time Bounds for Approximate Clustering
Ramgopal Mettu, Greg Plaxton

TL;DR
This paper introduces a new sampling technique called successive sampling for the k-median clustering problem, achieving a tight time complexity of Theta(nk) and providing constant-factor approximation guarantees.
Contribution
The paper presents a simple, efficient sampling method and an algorithm that tightly bounds the time complexity for approximate k-median clustering.
Findings
Successive sampling identifies small representative sets efficiently.
The algorithm runs in O(nk) time for a wide range of k values.
Established a tight lower bound matching the upper bound for randomized algorithms.
Abstract
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Bayesian Methods and Mixture Models
