Penalized k-means algorithms for finding the correct number of clusters   in a dataset

Behzad Kamgar-Parsi; Behrooz Kamgar-Parsi

arXiv:1911.06741·cs.LG·November 18, 2019·1 cites

Penalized k-means algorithms for finding the correct number of clusters in a dataset

Behzad Kamgar-Parsi, Behrooz Kamgar-Parsi

PDF

Open Access

TL;DR

This paper analyzes penalized k-means algorithms for determining the correct number of clusters, deriving bounds for additive penalties under ideal conditions and proposing a multiplicative penalty as a more robust alternative, supported by empirical results.

Contribution

It provides theoretical bounds for additive penalties in ideal clusters and introduces a parameter-free multiplicative penalty method with stronger cluster detection signatures.

Findings

01

Additive penalty bounds depend on the true number of clusters

02

Additive penalty often yields ambiguous signatures for cluster count

03

Multiplicative penalty offers a clearer, parameter-free alternative

Abstract

In many applications we want to find the number of clusters in a dataset. A common approach is to use the penalized k-means algorithm with an additive penalty term linear in the number of clusters. An open problem is estimating the value of the coefficient of the penalty term. Since estimating the value of the coefficient in a principled manner appears to be intractable for general clusters, we investigate "ideal clusters", i.e. identical spherical clusters with no overlaps and no outlier background noise. In this paper: (a) We derive, for the case of ideal clusters, rigorous bounds for the coefficient of the additive penalty. Unsurprisingly, the bounds depend on the correct number of clusters, which we want to find in the first place. We further show that additive penalty, even for this simplest case of ideal clusters, typically produces a weak and often ambiguous signature for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Statistical Methods and Inference