TL;DR
This paper introduces a Bayesian Information Criterion tailored for determining the number of clusters in data, with a new derivation that accounts for data structure, and proposes a two-step algorithm for cluster enumeration tested on synthetic and real datasets.
Contribution
It derives a new BIC for clustering that incorporates data structure and provides a practical two-step algorithm for cluster enumeration.
Findings
The new BIC improves cluster number estimation accuracy.
The proposed algorithm performs well on synthetic and real data.
Incorporating data structure into BIC alters the penalty term.
Abstract
We derive a new Bayesian Information Criterion (BIC) by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as a starting point when deriving the BIC for specific distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed variables. We show that incorporating the data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enumeration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
