Penalized k-means algorithms for finding the correct number of clusters in a dataset
Behzad Kamgar-Parsi, Behrooz Kamgar-Parsi

TL;DR
This paper analyzes penalized k-means algorithms for determining the correct number of clusters, deriving bounds for additive penalties under ideal conditions and proposing a multiplicative penalty as a more robust alternative, supported by empirical results.
Contribution
It provides theoretical bounds for additive penalties in ideal clusters and introduces a parameter-free multiplicative penalty method with stronger cluster detection signatures.
Findings
Additive penalty bounds depend on the true number of clusters
Additive penalty often yields ambiguous signatures for cluster count
Multiplicative penalty offers a clearer, parameter-free alternative
Abstract
In many applications we want to find the number of clusters in a dataset. A common approach is to use the penalized k-means algorithm with an additive penalty term linear in the number of clusters. An open problem is estimating the value of the coefficient of the penalty term. Since estimating the value of the coefficient in a principled manner appears to be intractable for general clusters, we investigate "ideal clusters", i.e. identical spherical clusters with no overlaps and no outlier background noise. In this paper: (a) We derive, for the case of ideal clusters, rigorous bounds for the coefficient of the additive penalty. Unsurprisingly, the bounds depend on the correct number of clusters, which we want to find in the first place. We further show that additive penalty, even for this simplest case of ideal clusters, typically produces a weak and often ambiguous signature for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Statistical Methods and Inference
