k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities
Sibylle Hess, Wouter Duivesteijn

TL;DR
This paper introduces a statistically grounded method to determine the optimal number of clusters in data, applicable across various clustering algorithms without distributional assumptions, by testing if two clusters likely originate from the same distribution.
Contribution
Proposes a new nonparametric bound for inferring the number of clusters, enabling automatic determination of k without data transformation or distribution assumptions.
Findings
Method effectively determines the number of clusters in nonconvex data.
Algorithm can identify single-cluster datasets.
Applicable as a wrapper to existing clustering algorithms.
Abstract
Most convex and nonconvex clustering algorithms come with one crucial parameter: the in -means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models
MethodsSpectral Clustering
