Algorithms for finding $k$ in $k$-means
Chiranjib Bhattacharyya, Ravindran Kannan, Amit Kumar

TL;DR
This paper introduces a data-driven method to determine the number of clusters in $k$-means clustering, providing the first polynomial-time algorithm under natural assumptions, applicable to Gaussian and sub-Gaussian mixtures.
Contribution
It proposes a novel, deterministic framework with assumptions like GT clustering and NTSC, enabling the identification of $k$ without prior knowledge, and offers a polynomial-time algorithm for this task.
Findings
Algorithm successfully identifies $k$ in synthetic and real data.
Applicable to Gaussian and sub-Gaussian mixture models.
Provides theoretical guarantees under natural assumptions.
Abstract
means Clustering requires as input the exact value of , the number of clusters. Two challenges are open: (i) Is there a data-determined definition of which is provably correct and (ii) Is there a polynomial time algorithm to find from data ? This paper provides the first affirmative answers to both these questions. As common in the literature, we assume that the data admits an unknown Ground Truth (GT) clustering with cluster centers separated. This assumption alone is not sufficient to answer Yes to (i). We assume a novel, but natural second constraint called no tight sub-cluster (NTSC) which stipulates that no substantially large subset of a GT cluster can be "tighter" (in a sense we define) than the cluster. Our yes answer to (i) and (ii) are under these two deterministic assumptions. We also give polynomial time algorithm to identify . Our algorithm relies on NTSC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Bayesian Methods and Mixture Models
