A score function for Bayesian cluster analysis
John Noble, {\L}ukasz Rajkowski

TL;DR
This paper introduces a parameter-free score function for Bayesian clustering that balances within-cluster variance and between-cluster entropy, aiding in selecting the optimal number of clusters.
Contribution
The proposed score function is a novel, parameter-free tool for Bayesian clustering that improves cluster number selection in existing methods.
Findings
Effective in choosing the number of clusters
Balances variance and entropy considerations
Applicable to hierarchical and K-means clustering
Abstract
We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or -means algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Data Management and Algorithms
A score function for Bayesian cluster analysis
John Noble
Łukasz Rajkowski
Abstract
We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or -means algorithm.
1 Introduction
Many clustering methods generate a family of clusterings that depend on some user-defined parameters. The most prominent example is the -means algorithm, where the investigator has to specify the number of clusters. Similarly, in hierarchical clustering, a whole family of clusterings is obtained, starting from the finest partition into singletons and ending in the coarsest clustering, i.e. a single cluster. Again, the investigator chooses the number of clusters based on the dendrogram.
All these methods come with a variety of suggestions how to choose the optimal number of clusters. Some of these are rather heuristic in nature, while others have deep theoretical foundations. For the -means algorithm these include the elbow method or average silhouette method (Rousseeuw (1987)). Another solution is to use a score statistic (a function which is intended to measure the quality of a clustering) and among different clusterings proposed by a given method choose the one that maximises the score statistic. Constructing score statistics is not a trivial task; one of the most popular choices is the gap statistic (Tibshirani et al. (2001)).
In this article we propose a new score statistic. It is derived as a limit of the first order approximation to the posterior probability (up to the norming constant) in a Nonparametric Bayesian Mixture Model with the inverse Wishart distribution as a base measure for the within group covariance matrices and the Gaussian distribution as a base measure for the cluster means and the component measure. In order to derive the limit we assume that the data is an independent sample from some ‘input’ probability distribution on the observation space; this gives a method of assessing the compatibility of the partitions of the observation space to the input distribution. The score function is obtained by taking the empirical measure as the input distribution and tweaking it slightly so that it is well defined on all possible data clusterings.
1.1 Contribution and Results
Our main contribution is the formulation of a novel score function for clusterings, which is motivated theoretically and performs well on analysed datasets. Suppose that we have a sequence of observations and we believe that it consists of several groups and within every group the data is distributed according to some Gaussian distribution (with unknown mean and covariance matrix). The goal is to construct a simple function that measures how well a given clustering of the dataset corresponds to the assumption of being Gaussially distributed within clusters. Our proposition is the following: for we define and and for the notational simplicity denote . For and – a partition of let
[TABLE]
It should be noted that if is a realisation of a random independent sample from some distribution on , then the components of the formula (1.1) can be treated as empirical estimates of relevant probabilities or the conditional covariance matrices. This is actually how (1.1) is obtained; we investigate the details in Section 3. This remark may be also convenient when dealing with large datasets where the exact computation of (1.1) could be time consuming. In such case we can approximate the variance components of (1.1) by using the random samples from clusters.
2 Score functions and the main formula
2.1 Basic definitions
We start our presentation with a formal definition of a score function, intended to measure the quality of the data clustering.
Notation**.**
For let and let be the set of all partitions of .
Let be the observation space. Let be the set of all possible finite sequences of observations and their partitions and let .
Definition**.**
A clustering score function is any function .
Definition**.**
Let be a score function and let be a family of functions from to . We say that is robust to if for every and and every we have if and only if , where f(\mathbf{x})=\big{(}f(x_{1}),\ldots,f(x_{n})\big{)}.
Hence robustness to means that if we apply any function to all observations, the optimal clustering indicated by the score function will not alter. If no prior knowledge about the clustering structure is available, a natural demand from a score function is to be robust to linear isomorphisms of . In particular, it should be robust to scaling of the axes since it would be strange if the result of applying the score function would depend on the units used to measure the observation. For the similar reasons, we expect a good score function to be robust to translations.
Note, that on the other hand the robustness to all linear transformation would be undesirable – in particular, moving all points to the origin is a linear transformation and we do not expect any clusters to be seen after applying it.
Notation**.**
Let and be two partitions of the same set. We say that is finer than if for every there exist such that . Equivalently, we say that is coarser than and we write .
Definition**.**
Let be a clustering score function. We say that it is non-increasing if for every and such that we have . If is non-increasing then is non-decreasing.
Clearly, no non-decreasing score function would be good for clustering purposes as it would assign the highest score to the clustering into one full cluster, regardless of the data. Similarly, a non-increasing function gives the highest score to the partition of singletons. It seems desirable for this two tendencies to interplay and it is theoretically appealing to find increasing and decreasing parts in a given score function.
2.2 Properties of the score function
Notation**.**
To facilitate the notation in the remaining part of the text we use to denote the determinant of a square matrix .
Definition**.**
With the notation presented in Section 1.1 we define
[TABLE]
and then (which is equivalent to (1.1)). Moreover, we use to denote with being a matrix of zeroes.
Property 1**.**
Let such that span . Let . Then for any .
Proof.
For any
[TABLE]
and hence is non-negative definite. Moreover, it follows from the assumptions that is positive definite. A sum of non-negative and positive definite matrix is positive definite, so its determinant is positive. Therefore all the summands in (1.1) are finite and the proof follows. ∎
Property 2**.**
The score function is robust to translations and linear isomorphisms.
Proof.
It is easy to check that for any , and any translation we have \mathcal{D}(\mathbf{x},\mathcal{I})=\mathcal{D}\big{(}T(\mathbf{x}),\mathcal{I}\big{)} and hence robustness to translations.
Let be a linear automorphism, defined by , where is a invertible matrix. Then
[TABLE]
which clearly implies robustness to linear isomorphisms. ∎
Property 3**.**
- (a)
* is increasing* 2. (b)
-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\hat{\mathbf{V}}_{\mathbf{x}}(I)\Big{|}* is decreasing* 3. (c)
-\sum_{I\in\mathcal{I}}\frac{|I|}{n}\ln\Big{|}\frac{\Sigma}{|I|}\Big{|}* is increasing*
Proof.
The proof of parts (a) and (b) follow from 6 by taking the empirical measure instead of . Part (c) follows from (a) because
[TABLE]
∎
3 The derivation
In this section we give the theoretical foundations for considering the function as clustering score function. We present a general formulation of a Bayesian Mixture Model and then we concentrate on the case where the data within clusters are distributed as Gaussians.
We analyse the asymptotics of the formula for the (unnormalised) posterior in this model. In this way we concentrate on scoring the partitions of the observation space rather than the data themselves. However, it is easy to switch to the score statistic by considering an empirical counterpart of instead of ; this yields (cf. (2.1)). The general form of (2.1) is constructed to prevent the function from assigning an infinite score to clusterings with very small clusters (of size less than the dimension of the observation space); on the other hand when the clusters are large enough, then approximates .
3.1 Bayesian Mixture Models
Let be the parameter space and be a family of probability measures on the observation space . Consider a prior distribution on . Let be a probability distribution on the -dimensional simplex \Delta^{m}=\{\bm{p}=(p_{i})_{i=1}^{m}\colon\textrm{\sum_{i=1}^{m}p_{i}=1p_{i}\geq 0i\leq m}\} (where ). Let
[TABLE]
This is a Bayesian Mixture Model. If a Gaussian distribution for all , we say that (3.1) defines a Bayesian Mixture of Gaussians. In this case a convenient choice of the parameter space is , where is the space of positive definite matrices. Then for the distribution is the multivariate normal distribution . A conjugate prior distribution on is the Normal-inverse-Wishart distribution, which is given by
[TABLE]
Here denotes the inverse Wishart distribution and the hyperparameters are , and . This prior is listed in Gelman et al. (2013) with a slightly different hyperparameters, but we made this modification to obtain
[TABLE]
which gives a nice interpretation of the hyperparameters.
Formula (3.1) can model data clustering; clusters are defined by deciding which generated a given data point. In order to formally define the clusters, we need to rewrite (3.1) as
[TABLE]
Then the clusters are the classes of abstraction of the equivalence relation . In this way the distribution on the dimensional simplex generates a probability distribution on the partitions of set into at most subsets.
Example 3.1**.**
Let , , for . Let to be the distribution of . The probability on the space of partitions of that generates is the Generalized Polya Urn Scheme (Blackwell et al. (1973)) also known as the Chinese Restaurant Process (Aldous (1985)) with the probability weight given by
[TABLE]
where .
Lemma 3.2**.**
Let be a probability distribution on that generates a probability on the partitions of . Then for every partition of
[TABLE]
where the ,,middle sum” ranges over all injective functions from to (with the convention ).
Proof.
If then both sides of (3.6) are 0. We now assume that . Let us go back to (3.4) and suppose that the weights and the atoms are fixed. We need to know what is the probability that induces a partition . This would mean that for every all the values for are equal to for some ; let . The values must be different for different , otherwise would not be generated. The probability of the sequence where for is equal to . Since any assignment of clusters to atoms is valid, so for fixed the probability of is equal to . Since is random, we have to integrate it out and (3.6) follows. ∎
Let be the probability distribution on the space of partitions generated by . We can formulate (3.1) as follows: firstly we generate the partition of observations into clusters, and then for every cluster we sample actual observations from the relevant marginal distribution. Formally, (3.1) is equivalent to
[TABLE]
where for , and , is the marginal density of , i.e.
[TABLE]
( is the density of ). We stress the fact that the independent sampling on the ‘lower’ level of (3.7) relates to the independence between clusters (conditioned on the random partition); within one cluster the observations are (marginally) dependent. To make the notation more concise we define
[TABLE]
Then (3.7) becomes
[TABLE]
The further analysis requires the exact formula for ; in our case it is straightforward to compute since and are conjugate. We state the result here for the reader’s convenience.
Proposition 1**.**
Let have the distribution given by (3.2) and let . Then the marginal distribution of is given by
[TABLE]
where is the multivariate Gamma function and
[TABLE]
[TABLE]
Proof.
The proof follows from Murphy (2007), equation (266). ∎
3.2 The Induced Partition
Throughout this section is some fixed probability distribution on .
Definition 3.3**.**
We say that a family of -measurable subsets of is a -partition if
- •
- •
for all , .
Notation**.**
Let be a -partition of the observation space. Let and for let where (if , we do not include it in ). We say that is induced by .
Proposition 2**.**
Let be a -partition of the observation space. Then is almost surely a partition of .
Proof.
The proof is straightforward and therefore omitted. ∎
Let and , where . That means is the conditional expected value and is the conditional covariance matrix of conditioned on the event . For a family of sets with positive measure let
[TABLE]
where means determinant. Let
[TABLE]
It turns out that basically (3.15) is (modulo constant) the first order approximation to the logarithm of the posterior probability in Bayesian Mixture Model of the data clustering defined by , when the data comes as an iid sample from .
Proposition 3**.**
, where
[TABLE]
Proof.
The result follows from 4 and 5. ∎
It should be noted that 3 does not depend on the form of the prior on probability measures. This prior is responsible for the ‘entropy‘ part of (3.16).
The final goal is not to score the partitions of the observation space but clusterings of the data. A natural idea is to replace the distribution in (3.15) by its empirical counterpart. Let be the empirical probability of . This is how is obtained.
The function would not be a good score statistic, because if contains a cluster of size less than then is singular and hence . To circumvent this, one could add some positive definite matrix to the within-group covariance matrix – in this way the relevant determinant will always be greater than zero. Since we would like to avoid any arbitrary constants in the score function, a natural idea is to use the covariance matrix of the whole dataset, . This operation is also motivated by considering the adaptive model, where the strength of prior distribution is increasing linearly with the number of observations. The details of this approach are given in Section 4. On the other hand, we do not want this modification to affect significantly when the sizes of clusters are large and the empirical covariance matrices are good estimates of theoretical ones. Therefore we decide to decrease the importance of the modification linearly with the cluster size. This gives (1.1), which is a well defined score statistic.
3.2.1 Auxiliary propositions
Proposition 4**.**
Let be a probability distribution on and let be a finite -partition of the observation space. Then
Before we present the proof of 4, we formulate an auxiliary lemma that concerns the asymptotics of the function .
Notation**.**
If and are real sequences, we write if . We write if . Similarly, if are real functions, we write if and a(x)=o\big{(}b(x)\big{)} if .
Lemma 3.4**.**
Let . If and then .
Proof.
For sufficiently large we have and , hence
[TABLE]
Left- and right-hand side of (3.17) converge to 1, so . The proof follows from .
∎
Lemma 3.5**.**
If and x_{n}/n-\lambda=o\big{(}\frac{1}{n^{a}}\big{)} for some then .
Proof.
Recall Stirling’s formula: It follows from 3.4 that
[TABLE]
since . Note that for fixed we have and as a result
[TABLE]
∎
Proof of 4.
Note that is a random variable with distribution for all . Due to Law of Iterated Logarithm we have that almost surely \big{(}|J^{A}_{n}|/n-P(A)\big{)}=o(n^{-1/2+\varepsilon}) for any and hence the assumptions of 3.5 are almost surely satisfied, so
[TABLE]
Because is finite and , it means that
[TABLE]
By the strong law of large numbers we have that
[TABLE]
and hence, by (3.13), for
[TABLE]
Hence . Using the Law of Iterated Logarithm and 3.4 again we get
[TABLE]
which means
[TABLE]
and therefore
[TABLE]
∎
Proposition 5**.**
Let be a probability distribution on and let be a finite -partition of the observation space. Let be a probability distribution on the partitions of , generated by the probability distribution on . Then .
Proof.
The proof is a direct consequence of the Law of Large Numbers and 3.8. ∎
By , consists of two components: and . These two behave differently when two clusters are joined; the variance component is increasing whereas the entropy component is decreasing.
Proposition 6**.**
Let be a partition of and let . Let be a partition obtained from by joining and , i.e. . Then
- (a)
** 2. (b)
.
Proof.
Let .
Part (a):
[TABLE]
and the proof follows. The last inequality in (3.27) comes from .
Lemma 3.6**.**
Let . Then
[TABLE]
where is the Löwner partial order, i.e. iff is non-negative definite.
Proof.
Let and where . Then
[TABLE]
Note that the functions are additive, hence
[TABLE]
The last matrix in (3.30) is clearly non-negative definite and the proof follows. ∎
Theorem 3.7**.**
(Theorem 2.4.4 in Horn et al. (1990))* The function is convex on the space of positive definite matrices.*
Proof of part (b):
[TABLE]
and the proof follows. ∎
Theorem 3.8**.**
Let be a probability distribution on the partitions of , generated by the probability distribution on . Fix and consider a sequence of partitions , where is a partition of (it is possible that for some ). Assume that for . Then
[TABLE]
Proof.
Firstly note that for sufficiently large we have for all . Then in (3.6) we sum functions that depend on exactly coordinates of . Hence we can express in the form of an integral on the -dimensional set as
[TABLE]
where is a measure on defined by
[TABLE]
for , where . Hence
[TABLE]
where and is the norm in space.
Since is not a finite measure on , in the remaining part of the proof we will have to be careful that the functions we are considering belong to the space for sufficiently large .
Let and let . Note that
[TABLE]
Moreover for we have and therefore for . Because is bounded by 1 we get
[TABLE]
(the fact that follows easily from applying the Lagrange multipliers).
We now prove that . It is not a straightforward consequence of the pointwise convergence of to since is not a finite measure on .
Clearly, and hence on .
Let be chosen so that for we have and for . Then for
[TABLE]
hence . The result follows from the triangle inequality
[TABLE]
∎
Lemma 3.9**.**
Let for and . Let . Then .
Proof.
As for , the function is continuous and, because is compact in , it achieves its extreme values. Let satisfy . Clearly, . Indeed, otherwise , and , which contradicts the definition of . Since is nonnegative on and it is equal to 0 on the boundary of , we know that is in the interior of . The function is positive on the interior of , so by considering the function and using the Lagrange multipliers, we gat that satisfies
[TABLE]
for and some . Hence ’s are proportional to ’s, and because , we get that and the proof follows. ∎
4 Adaptive model
We now allow parameters of the model (3.2) to change with the number of observations. More precisely, we perform a substitution so that the expected value of the within group precision matrix is fixed and increasingly concentrated on . We investigate the limit formula for the posterior as goes to infinity. Note that in this case .
[TABLE]
Proposition 7**.**
Let be a probability distribution on and let be a finite -partition of the observation space. Then
[TABLE]
Proof.
Note that is a random variable with distribution for all . Due to Law of Iterated Logarithm we have that almost surely \big{(}|J^{A}_{n}|/n-P(A)\big{)}=o(n^{-1/2+\varepsilon}) for any and hence the assumptions of 3.5 are almost surely satisfied, so
[TABLE]
Because is finite and , it means that
[TABLE]
By the strong law of large numbers we have that
[TABLE]
and hence, by (3.13), for
[TABLE]
Hence |\Sigma(\mathbf{X}_{J^{\mathcal{A}}_{n}})|\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\approx}}n^{d}\big{(}P(A)+\lambda\big{)}^{d}|\frac{\lambda}{P(A)+\lambda}\Sigma_{0}+\frac{P(A)}{P(A)+\lambda}\mathbf{V}_{P}(A)|. Using the Law of Iterated Logarithm and 3.4 again we get
[TABLE]
and (4.2) follows. ∎
5 Discussion
In this article we proposed a score function that can be used for choosing the number of clusters in popular clustering methods. It is derived as a limit in a Bayesian Mixture Model of Gaussians. We derived some of its properties, though there are some questions that remain unanswered. For example, it is interesting to ask what assumptions on should be made to ensure that the supremum of possible values of the function is finite.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aldous [1985] David J Aldous. Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983 , pages 1–198. Springer, 1985.
- 2Blackwell et al. [1973] David Blackwell, James B Mac Queen, et al. Ferguson distributions via pólya urn schemes. The annals of statistics , 1(2):353–355, 1973.
- 3Gelman et al. [2013] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis . Chapman and Hall/CRC, 2013.
- 4Horn et al. [1990] Roger A Horn, Roger A Horn, and Charles R Johnson. Matrix analysis . Cambridge university press, 1990.
- 5Murphy [2007] Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def , 1(2 σ 𝜎 \sigma 2):16, 2007.
- 6Rousseeuw [1987] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics , 20:53–65, 1987.
- 7Tibshirani et al. [2001] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63(2):411–423, 2001.
