Jeffreys priors for mixture estimation: properties and alternatives

Clara Grazian; Christian P. Robert

arXiv:1706.02563·stat.ME·December 13, 2017·Comput. Stat. Data Anal.

Jeffreys priors for mixture estimation: properties and alternatives

Clara Grazian, Christian P. Robert

PDF

1 Repo

TL;DR

This paper investigates Jeffreys priors for mixture models, highlighting their properties, challenges like impropriety, and proposing their use as default priors for mixture weights in overfitted models.

Contribution

It provides a detailed analysis of Jeffreys priors in mixture estimation, including their properties, limitations, and a novel application as default priors for mixture weights.

Findings

01

Jeffreys priors are often improper and lack closed-form expressions.

02

The posterior distributions with Jeffreys priors are mostly improper.

03

Jeffreys priors for mixture weights are conservative regarding the number of components.

Abstract

While Jeffreys priors usually are well-defined for the parameters of mixtures of distributions, they are not available in closed form. Furthermore, they often are improper priors. Hence, they have never been used to draw inference on the mixture parameters. The implementation and the properties of Jeffreys priors in several mixture settings are studied. It is shown that the associated posterior distributions most often are improper. Nevertheless, the Jeffreys prior for the mixture weights conditionally on the parameters of the mixture components will be shown to have the property of conservativeness with respect to the number of components, in case of overfitted mixture and it can be therefore used as a default priors in this context.

Tables2

Table 1. Table 1: Posterior means for the weights, the means and the standard deviations of a ten-component mixture model, assumed for the galaxy, the enzyme and the acidity datasets (the first number in brackets is the posterior mean and the second is the posterior standard deviation). We have decided to not shown the estimated location and scale parameters when the weights are concentrated around zero.

Dataset:	galaxy	enzyme	acidity
$p_{1}$	0.437	0.606	0.601
	(23.139, 1.507)	(0.193, 0.090)	(4.356,0.442)
$p_{2}$	0.390	0.343	0.378
	(19.790, 0.715)	(1.216, 0.348)	(6.294, 0.531)
$p_{3}$	0.080	0.021	0.003
	( 9.709, 0.503)	(0.915, 1.174)	(0.083, 0.802)
$p_{4}$	0.056	0.018	0.003
	(32.630, 1.842)	(1.176, 0.702)	(0.125, 0.589)
$p_{5}$	0.037	0.000	0.000
	(16.138,1.226)	-	-
$\sum_{ℓ = 6}^{10} p_{ℓ}$	0.000	0.000	0.000

Table 2. Table 2: Posterior means for the weights, the means and the standard deviations of a ten-component mixture model, assumed for the network dataset (credible intervals of level 0.95 in brackets).

Gaussian comp.
$p$	$μ$	$σ$
0.214	224.318	50.271
(0.180,0.249)	(222.657,233.842)	(45.483,55.265)
0.519	161.645	7.497
(0.474,0.568)	(160.216,161.882)	(6.830,8.212)
0.221	82.847	1.888
(0.188,0.257)	(81.057,82.270)	(1.666,2.135)
0.046	92.826	3.474
(0.029,0.064)	(91.710 ,93.700)	(2.698,4.388
$\sum_{ℓ = 5}^{10} p_{ℓ} = 0.000$
Gumbel comp.
$p$	$μ$	$σ$
0.214	213.512	59.080
(0.183,0.251)	(213.446,213.846)	(53.526,64.667)
0.520	160.164	7.959
(0.479,0.562)	(160.113,160.482)	(7.465,8.482)
0.265	83.260	3.348
(0.219,0.302)	(83.251,83.270)	(3.005,3.753)
$\sum_{ℓ = 4}^{10} p_{ℓ} = 0.000$

Equations145

ℓ = 1 \sum k p_{ℓ} f_{ℓ} (x ∣ θ_{ℓ}), ℓ = 1 \sum k p_{ℓ} = 1,

ℓ = 1 \sum k p_{ℓ} f_{ℓ} (x ∣ θ_{ℓ}), ℓ = 1 \sum k p_{ℓ} = 1,

π^{J} (θ) \propto ∣ I (θ) ∣^{\nicefrac 12} = - E [\frac{\partial ^{2}}{\partial θ \partial θ ^{T}} lo g g (X; θ)]^{\nicefrac 12},

π^{J} (θ) \propto ∣ I (θ) ∣^{\nicefrac 12} = - E [\frac{\partial ^{2}}{\partial θ \partial θ ^{T}} lo g g (X; θ)]^{\nicefrac 12},

\frac{2 ϵ}{σ _{1}} f (\frac{x - μ}{σ _{1}}) I_{x < μ} + \frac{2 ( 1 - ϵ )}{σ _{2}} f (\frac{x - μ}{σ _{2}}) I_{x > μ}

\frac{2 ϵ}{σ _{1}} f (\frac{x - μ}{σ _{1}}) I_{x < μ} + \frac{2 ( 1 - ϵ )}{σ _{2}} f (\frac{x - μ}{σ _{2}}) I_{x > μ}

π (μ, σ_{1}, σ_{2}) \propto \nicefrac 1 σ_{1} σ_{2} {σ_{1} + σ_{2}} .

π (μ, σ_{1}, σ_{2}) \propto \nicefrac 1 σ_{1} σ_{2} {σ_{1} + σ_{2}} .

- \bigintss_{X} \frac{\partial ^{2} lo g [ ℓ = 1 \sum k p _{ℓ} f _{ℓ} ( x ∣ θ _{ℓ} ) ]}{\partial θ _{j} \partial θ _{h}} [ℓ = 1 \sum k p_{ℓ} f_{ℓ} (x ∣ θ_{ℓ})]^{- 1} d x

- \bigintss_{X} \frac{\partial ^{2} lo g [ ℓ = 1 \sum k p _{ℓ} f _{ℓ} ( x ∣ θ _{ℓ} ) ]}{\partial θ _{j} \partial θ _{h}} [ℓ = 1 \sum k p_{ℓ} f_{ℓ} (x ∣ θ_{ℓ})]^{- 1} d x

\int_{X} \frac{( f _{j} ( x ) - f _{k} ( x )) ( f _{h} ( x ) - f _{k} ( x ))}{\sum _{ℓ = 1}^{k} p _{ℓ} f _{ℓ} ( x )} d x

\int_{X} \frac{( f _{j} ( x ) - f _{k} ( x )) ( f _{h} ( x ) - f _{k} ( x ))}{\sum _{ℓ = 1}^{k} p _{ℓ} f _{ℓ} ( x )} d x

p_{k} = 1 - p_{1} - \dots - p_{k - 1} .

p_{k} = 1 - p_{1} - \dots - p_{k - 1} .

g (x, z ∣ θ, p)

g (x, z ∣ θ, p)

= i = 1 \prod n ℓ = 1 \prod k [f_{ℓ} (x_{i} ∣ θ_{ℓ}) p_{ℓ}]^{I_{[z_{i, ℓ]} = 1}} = ℓ = 1 \prod k i : z_{i, ℓ} = 1 \prod f_{l} (x_{i} ∣ θ_{ℓ}) [ℓ = 1 \prod k p_{ℓ}^{n_{ℓ}}]

- E [\frac{\partial ^{2}}{\partial p _{ℓ}^{2}} lo g g (x, z ∣ θ, p)]

- E [\frac{\partial ^{2}}{\partial p _{ℓ}^{2}} lo g g (x, z ∣ θ, p)]

- E [\frac{\partial ^{2}}{\partial p _{ℓ} \partial p _{j}} lo g g (x, z ∣ θ, p)]

lo g g (x ∣ θ, p) = lo g g (x, z ∣ θ, p) - i = 1 \sum n ℓ = 1 \sum k I_{[z_{i, ℓ} = 1]} lo g p (z_{i, ℓ} = 1∣ x_{i}, θ, p)

lo g g (x ∣ θ, p) = lo g g (x, z ∣ θ, p) - i = 1 \sum n ℓ = 1 \sum k I_{[z_{i, ℓ} = 1]} lo g p (z_{i, ℓ} = 1∣ x_{i}, θ, p)

det (I (p))

det (I (p))

[det (I (p))]^{1/2}

π_{J} (p)

\int_{X} \frac{( f _{j} ( x ) - f _{k} ( x )) ( f _{ℓ} ( x ) - f _{k} ( x ))}{f _{j} ( x )} d x

\int_{X} \frac{( f _{j} ( x ) - f _{k} ( x )) ( f _{ℓ} ( x ) - f _{k} ( x ))}{f _{j} ( x )} d x

p_{1} f_{1} (x ∣ μ, τ) + ℓ = 2 \sum k p_{ℓ} f_{ℓ} (\frac{a _{ℓ} + x}{b _{ℓ}} ∣ μ, τ, a_{ℓ}, b_{ℓ}) .

p_{1} f_{1} (x ∣ μ, τ) + ℓ = 2 \sum k p_{ℓ} f_{ℓ} (\frac{a _{ℓ} + x}{b _{ℓ}} ∣ μ, τ, a_{ℓ}, b_{ℓ}) .

p N (μ, τ^{2}) + (1 - p) N (μ + τ δ, τ^{2} σ^{2})

p N (μ, τ^{2}) + (1 - p) N (μ + τ δ, τ^{2} σ^{2})

p N (μ, τ^{2}) + ℓ = 1 \sum k - 2 (1 - p) (1 - q_{1}) \dots (1 - q_{ℓ - 1}) q_{ℓ}

p N (μ, τ^{2}) + ℓ = 1 \sum k - 2 (1 - p) (1 - q_{1}) \dots (1 - q_{ℓ - 1}) q_{ℓ}

\cdot N (μ + τ θ_{1} + \dots + τ \dots σ_{ℓ - 1} θ_{ℓ}, τ^{2} σ_{1}^{2} \dots σ_{ℓ}^{2}) +

+ (1 - p) (1 - q_{1}) \dots (1 - q_{k - 2})

\cdot N (μ + τ θ_{1} + \dots + τ \dots σ_{k - 2} θ_{k - 1}, τ^{2} σ_{1}^{2} \dots σ_{k - 1}^{2}) .

p N (μ, τ^{2})

p N (μ, τ^{2})

+ (1 - p) (1 - q) N (μ + τ θ + τ σ ϵ, τ^{2} σ_{1}^{2} σ_{2}^{2})

g (x ∣ θ) = ℓ = 1 \sum k p_{i} N (x ∣ μ_{ℓ}, σ_{ℓ}) .

g (x ∣ θ) = ℓ = 1 \sum k p_{i} N (x ∣ μ_{ℓ}, σ_{ℓ}) .

μ_{ℓ}

μ_{ℓ}

σ_{ℓ}

p ∣ μ, σ

π (μ_{0}, ζ_{0}) \propto \frac{1}{ζ _{0}} .

π (μ_{0}, ζ_{0}) \propto \frac{1}{ζ _{0}} .

0.5 N (- 3, 1) + 0.5 N (3, 1) .

0.5 N (- 3, 1) + 0.5 N (3, 1) .

g_{X} (x) = l = 1 \sum k p_{l} N (x ∣ μ_{l}, σ_{l}^{2})

g_{X} (x) = l = 1 \sum k p_{l} N (x ∣ μ_{l}, σ_{l}^{2})

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{i}^{2}}] = \frac{p _{i}^{2}}{σ _{i}^{4}} \bigintsss_{- \infty}^{\infty} \frac{[ ( x - μ _{i} ) N ( x ∣ μ _{i} , σ _{i}^{2} ) ] ^{2}}{\sum _{ℓ = 1}^{k} p _{ℓ} N ( x ∣ μ _{ℓ} , σ _{ℓ}^{2} )} d x,

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{i}^{2}}] = \frac{p _{i}^{2}}{σ _{i}^{4}} \bigintsss_{- \infty}^{\infty} \frac{[ ( x - μ _{i} ) N ( x ∣ μ _{i} , σ _{i}^{2} ) ] ^{2}}{\sum _{ℓ = 1}^{k} p _{ℓ} N ( x ∣ μ _{ℓ} , σ _{ℓ}^{2} )} d x,

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{i} \partial μ _{j}}] = \frac{p _{i} p _{j}}{σ _{i}^{2} σ _{j}^{2}} \cdot

\cdot \bigintsss_{- \infty}^{\infty} \frac{( x - μ _{i} ) N ( x ∣ μ _{i} , σ _{i}^{2} ) ( x - μ _{j} ) N ( x ∣ μ _{j} , σ _{j}^{2} )}{\sum _{ℓ = 1}^{k} p _{ℓ} N ( x ∣ μ _{ℓ} , σ _{ℓ}^{2} )} d x .

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{j}^{2}}] = \frac{p _{j}^{2}}{σ _{j}^{4}} \cdot

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{j}^{2}}] = \frac{p _{j}^{2}}{σ _{j}^{4}} \cdot

\cdot \bigintsss_{- \infty}^{\infty} \frac{[ ( t - μ _{j} + μ _{i} ) N ( t ∣ μ _{j} - μ _{i} , σ _{i}^{2} ) ] ^{2}}{\sum _{ℓ = 1}^{k} p _{ℓ} N ( t ∣ μ _{ℓ} - μ _{i} , σ _{ℓ}^{2} )} d x,

E [- \frac{\partial ^{2} lo g g _{X} ( X )}{\partial μ _{j} \partial μ _{m}}] = \frac{p _{j} p _{m}}{σ _{j}^{2} σ _{m}^{2}} \cdot

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cgrazian/Jeffreys_mixtures
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Jeffreys priors for mixture estimation: properties and alternatives

Clara Grazian Corresponding Author: Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Microbiology Department, Headley Way, Oxford, OX3 9DU, United Kingdom. mail: [email protected]

Christian P. Robert CEREMADE Université Paris-Dauphine, University of Warwick and CREST, Paris. e-mail: [email protected].

Abstract

While Jeffreys priors usually are well-defined for the parameters of mixtures of distributions, they are not available in closed form. Furthermore, they often are improper priors. Hence, they have never been used to draw inference on the mixture parameters.The implementation and the properties of Jeffreys priors in several mixture settings are studied. It is shown that the associated posterior distributions most often are improper. Nevertheless, the Jeffreys prior for the mixture weights conditionally on the parameters of the mixture components will be shown to have the property of conservativeness with respect to the number of components, in case of overfitted mixture and it can be therefore used as a default priors in this context.

Noninformative prior,

mixture of distributions,

Bayesian analysis,

Dirichlet prior,

improper prior,

improper posterior,

label switching,

keywords:

\startlocaldefs\endlocaldefs

1 Introduction

Bayesian inference in mixtures of distributions has been studied quite extensively in the literature. See, e.g., MacLachlan and Peel (2000) and Frühwirth-Schnatter (2006) for book-long references and Lee et al. (2009) for one among many surveys. From a Bayesian perspective, one of the several difficulties with this type of distribution,

[TABLE]

is that its ill-defined nature (non-identifiability, multimodality, unbounded likelihood, etc.) leads to restrictive prior modelling since most improper priors are not acceptable. This is due in particular to the feature that a sample from (1) may contain no subset from one of the $k$ components $f(\cdot|\theta_{\ell})$ (see. e.g., Titterington et al., 1985). Albeit the probability of such an event is decreasing quickly to zero as the sample size grows, it nonetheless prevents the use of independent improper priors, unless such events are prohibited (Diebolt and Robert, 1994). Similarly, the exchangeable nature of the components often induces both multimodality in the posterior distribution and convergence difficulties as exemplified by the label switching phenomenon that is now quite well-documented (Celeux et al., 2000; Stephens, 2000; Jasra et al., 2005; Frühwirth-Schnatter, 2006; Geweke, 2007; Puolamäki and Kaski, 2009). This feature is characterized by a lack of symmetry in the outcome of a Monte Carlo Markov chain (MCMC) algorithm, in that the posterior density is exchangeable in the components of the mixture but the MCMC sample does not exhibit this symmetry. In addition, most MCMC samplers do not concentrate around a single mode of the posterior density, partly exploring several modes, which makes the construction of Bayes estimators of the components much harder.

When specifying a prior over the parameters of (1), it is therefore quite delicate to produce a manageable and sensible non-informative version and some have argued against using non-informative priors in this setting (for example, MacLachlan and Peel (2000) argues that it is impossible to obtain proper posterior distributions from fully noninformative priors), on the basis that mixture models are ill-defined objects that require informative priors to give a meaning to the notion of a component of (1). For instance, the distance between two components needs to be bounded from below to avoid repeating the same component indefinitely. Alternatively, the components all need to be informed by the data, as exemplified in Diebolt and Robert (1994) who imposed a completion scheme (i.e., a joint model on both parameters and latent variables) such that all components were allocated at least two observations, thereby ensuring that the (truncated) posterior was well-defined. Wasserman (2000) proved ten years later that this truncation led to consistent estimators and moreover that only this type of priors could produce consistency. While the constraint on the allocations is not fully compatible with the i.i.d. representation of a mixture model, it naturally expresses a modelling requirement that all components have a meaning in terms of the data, namely that all components genuinely contributed to generating a part of the data. This translates as a form of weak prior information on how much one trusts the model and how meaningful each component is on its own (by opposition with the possibility of adding meaningless artificial extra-components with almost zero weights or almost identical parameters).

While we do not seek Jeffreys priors as the ultimate prior modelling for non-informative settings, being altogether convinced of the lack of unique reference priors (Robert, 2001a; Robert et al., 2009), we think it is nonetheless worthwhile to study the performances of those priors in the setting of mixtures in order to determine if indeed they can provide a version of reference priors and if they are at least well-defined in such settings. We will show that only in very specific situations the Jeffreys prior provides reasonable inference.

In Section 2 we provide a formal characterisation of properness of the posterior distribution for the parameters of a mixture model, in particular with Gaussian components, when a Jeffreys prior is used for them. In Section 3 we will analyze the properness of the Jeffreys prior and of the related posterior distribution: only when the weights of the components (which are defined in a compact space) are the only unknown parameters it turns out that the Jeffreys prior (and so the relative posterior) is proper; on the other hand, when the other parameters are unknown, the Jeffreys prior will be proved to be improper and in only one situation it provides a proper posterior distribution. In Section 4 we present a way to realize a noninformative analysis of mixture models, in particular we propose to use the Jeffreys prior as a default prior in case of overfitted mixtures and introduce improper priors for at least some parameters. The default proposal of Section 4 will be tested on several simulation studies in Section 5 and several real examples in Section 6, on both well known datasets in the mixture literature and a new dataset. Section 7 concludes the paper.

2 Jeffreys priors for mixture models

We recall that the Jeffreys prior was introduced by Jeffreys (1939) as a default prior based on the Fisher information matrix

[TABLE]

whenever the later is well-defined; $I(\cdot)$ stands for the expected Fisher information matrix and the symbol $|\cdot|$ denotes the determinant. Although the prior is endowed with some frequentist properties like matching and asymptotic minimal information (Robert, 2001a, Chapter 3), it does not constitute the ultimate answer to the selection of prior distributions in non-informative settings and there exist many alternatives such as reference priors (Berger et al., 2009), maximum entropy priors (Rissanen, 2012), matching priors (Ghosh et al., 1995), and other proposals (Kass and Wasserman, 1996). In most settings Jeffreys priors are improper, which may explain for their conspicuous absence in the domain of mixture estimation, since the latter prohibits the use of independent improper priors by allowing any subset of components to go “empty” with positive probability. That is, the likelihood of a mixture model can always be decomposed as a sum over all possible partitions of the data into $k$ groups at most, where $k$ is the number of components of the mixture. This means that there are terms in this sum where no observation from the sample brings any amount of information about the parameters of a specific component.

Approximations of the Jeffreys prior in the setting of mixtures can be found, e.g., in Figueiredo and Jain (2002), where the authors revert to independent Jeffreys priors on the components of the mixture. This induces the same negative side-effect as with other independent priors, namely an impossibility to handle improper priors. Rubio and Steel (2014) provides a closed-form expression for the Jeffreys prior for a location-scale mixture with two components. The family of distributions considered in Rubio and Steel (2014) is

[TABLE]

(which thus hardly qualifies as a mixture, due to the orthogonality in the supports of both components that allows to identify which component each observation is issued from). The factor $2$ in the fraction is due to the assumption of symmetry around zero for the density $f$ . For this specific model, if we impose that the weight $\epsilon$ is a function of the variance parameters, $\epsilon=\nicefrac{{\sigma_{1}}}{{\sigma_{1}+\sigma_{2}}},$ the Jeffreys prior is given by

[TABLE]

However, in this setting, Rubio and Steel (2014) demonstrates that the posterior associated with the (regular) Jeffreys prior is improper, hence not relevant for conducting inference. Rubio and Steel (2014) also considers alternatives to the genuine Jeffreys prior, either by reducing the range or even the number of parameters, or by building a product of conditional priors. They further consider so-called non-objective priors that are only relevant to the specific case of the above mixture.

Another obvious explanation for the absence of Jeffreys priors is computational, namely the closed-form derivation of the Fisher information matrix is analytically unavailable. The reason is that the generic $[j,h]$ -th element, with $j,h\in\{1,\cdots,k\}$ , of the Fisher information matrix for mixture models is an integral of the form

[TABLE]

(in the special case of component densities with a univariate parameter) which cannot be computed analytically. Since these are unidimensional integrals, we derive an approximation of the elements of the Fisher information matrix based on Riemann sums. The resulting computational expense is of order $\mathrm{O}(b^{2})$ if $b$ is the total number of (independent) parameters. Since the elements of the information matrix usually are ratios between the component densities and the mixture density, there may be difficulties with non-probabilistic methods of integration.

3 Characterization of the Jeffreys priors for mixture models and respective posteriors

Unsurprisingly, most Jeffreys priors associated with mixture models are improper, the exception being when only the weights of the mixture are unknown, as already demonstrated in Bernardo and Giròn (1988).

We will characterize properness and improperness of Jeffreys priors and derived posteriors, when some or all of the parameters of distributions from location-scale families are unknown. These results are analytically established; the behavior of the Jeffreys prior and of the deriving posterior has also been studied through simulations, with sufficiently large Monte Carlo experiments (see Section 5). The following results are often presented for Gaussian mixture models, anyway, the Jeffreys prior has a behavior common to all the location-scale families; therefore the results may be generalized to any location-scale family.

3.1 Weights of mixture unknown

A representation of the Jeffreys prior and the derived posterior distribution for the weights of a three-component mixture model is given in Figure 1: the prior distribution is much more concentrated around extreme values in the support, i.e., it is a prior distribution conservative in the number of important components.

Lemma 3.1.

When the weights $p_{i}$ are the only unknown parameters in (1), the corresponding Jeffreys prior is proper.

Proof.

The generic element of the Fisher information matrix $I(p)$ of the mixture model (1) when the weights are the only unknown parameters is (for $j,h=\{1,\ldots,k-1\}$ )

[TABLE]

when we consider the parametrization in $(p_{1},\ldots,p_{k-1})$ , with

[TABLE]

Consider now a data augmented model, where a latent variable describing the allocations of each observation to the particular component is introduced. In other words, a latent variable $z_{i}$ is considered such that $z_{i}=(0\cdots 1\cdots 0)$ , where $z_{i\ell}=1$ in the $\ell$ -th position of the vector if $x_{i}$ has been generated from the $\ell$ -th components, for $i=1,\cdots,n$ where $n$ is the sample size and $\ell=1,\cdots,k$ . Therefore, $z=(z_{1},\ldots,z_{n})$ is a multinomial variable for $k$ possible outcomes such that

[TABLE]

where $\mathbb{I}_{[z_{i,\ell}=1]}$ is the indicator function that $z_{i,\ell}=1$ and $n_{\ell}$ is the number of allocations to the $\ell$ -th component. For an extensive review of the techniques of data augmentation in the case of mixture models one may refer to Frühwirth-Schnatter (2006).

Equation (6) shows that the likelihood function is separable for $\theta$ and $p$ and that the second part is multinomial. Therefore, when looking for the Jefffreys prior for the weights of a complete (data-augmented) mixture model, the elements of the Fisher information matrix are

[TABLE]

leading to the usual Jeffreys prior associated to the multinomial model, a Dirichlet distribution $\mathcal{D}ir(\frac{1}{2},\cdots,\frac{1}{2})$ .

The above only applies to the artificial case when the allocations $z_{i}$ are known. When they are unknown, it is easy to see that the log-likelihood function becomes

[TABLE]

where the second term on the right side of the equation represents the loss of information compared to the data-augmented likelihood function. Define the expected Fisher information matrix for model (6) (when only the weights are unknown) as $I^{data-aug}(p,\theta)$ . Therefore, since the difference between both matrices is positive definite, this implies that

[TABLE]

This results shows that the Jeffreys prior on the weights of a mixture model when allocations are unknown is proper since bounded by the Jeffreys prior $\mathcal{D}ir(\frac{1}{2},\cdots,\frac{1}{2})$ for the complete model.

As a particular case, when all the mixands converge to the same distribution, each of the elements of the form (4) tends to

[TABLE]

which does not depend on $p$ . Therefore, in this case, the determinant of the deriving Fisher information matrix is constant in $p=(p_{1},\cdots,p_{k})$ and the resulting Jeffreys prior is uniform on the $k$ -dimensional simplex.

∎

We note that this result is a generalization to a $k$ -component mixture of the prior derived in Bernardo and Giròn (1988) for $k=2$ (however, these authors derive the reference prior for the limiting cases when all the components have pairwise disjoint supports and when all the components converge to the same distribution). This reasoning led Bernardo and Giròn (1988) to conclude that the usual $\mathcal{D}(\lambda_{1},\ldots,\lambda_{k})$ Dirichlet prior with $\lambda_{\ell}\in[\nicefrac{{1}}{{2}},1]$ for $\forall\ell=1,\cdots,k$ seems to be a reasonable approximation. They also prove that the Jeffreys prior for the weights $p_{\ell}$ is convex, with an argument based on the sign of the second derivative.

It is important to stress that, in a mixture model setting, it is usual to saturate the model when the number of components is not surely known a priori and consider a large number of components $k$ . The main difficulty in this setting is non-identifiability, in particular the rate of estimation for the satured model is much slower than the standard $1/\sqrt{n}$ . Rousseau and Mengersen (2011) have studied the effect of a prior distribution on the weights of a general mixture on regularizing the posterior distribution, i.e. consistency to a single configuration of the reduced parameter space. This is achievable with a prior which allows to empty the extra-components or to merge the existing ones. In particular, Rousseau and Mengersen (2011) propose a Dirichlet prior distribution, with parameters $\lambda_{1},\cdots,\lambda_{k}$ smaller than $r/2$ (where $r$ is the dimension of $\theta_{\ell}$ ) to empty the extra-components or larger than $r/2$ to merge the extra-components. However, the choice of $\lambda_{j}\,(j=1,\cdots,k)$ is quite influential for finite sample sizes. The configuration studied in the proof of Lemma 3.1 is compatible with the Dirichlet configuration of the prior proposed by Rousseau and Mengersen (2011). This is an important property of the Jeffreys prior, since it makes the prior conservative in the number of the components. Namely, one can asymptotically identify the components that are artificially added to the model but have no meaning for the data. Moreover, it offers an automatic choice, on the contrary of the Dirichlet prior where the hyper-parameters have to been chosen.

The shape of the Jeffreys prior for the weights of a mixture model depends on the type of the components: see Appendix A of the Supplementary Material for a discussion. The marginal Jeffreys prior for the weight of one component is more concentrated around one if that component is more informative in terms of Fisher information matrix: for example, if we consider a two-component mixture model with a Gaussian and a Student t component, the Jeffreys prior for the weights will be more symmetric as the number of degrees of freedom of the Student t increases.

3.2 Weights, location and scale parameters of a mixture model unknown

In this Section we will consider mixtures of location-scale distributions. If the components of the mixture model (1) are distributions from a location-scale family and the location or scale parameters of the mixture components are unknown, this turns the mixture itself into a location-scale model:

[TABLE]

As a result, model (1) may be reparametrized following Mengersen and Robert (1996), in the case of Gaussian components

[TABLE]

namely using a reference location $\mu$ and a reference scale $\tau$ (which may be, for instance, the location and scale of a specific component). Equation (8) may be generalized to the case of $k$ components as

[TABLE]

Since the mixture model is a location-scale model, the Jeffreys prior is as in the following Lemma (see also (Robert, 2001a, Chapter 3)).

Lemma 3.2.

When the parameters of a location-scale mixture model are unknown, the Jeffreys prior is improper, constant in $\mu$ and powered as $\tau^{-d/2}$ , where $d$ is the total number of unknown parameters of the components (i.e. excluding the weights).

An new version of the proof, never presented before, is available in Appendix B of the Supplementary Material, while the characterization of the Jeffreys prior for $\delta$ is given in Appendix C.

We now derive analytical characterizations of the posterior distributions associated with the Jeffreys priors for mixture models.

Consider, first, the case where only the location parameters of a mixture model are unknown.

There is a substantial difference between the cases where $k=2$ or $k>2$ .

Lemma 3.3.

When $k=2$ , the posterior distribution derived from the Jeffreys prior when only the location parameters of model (14) are unknown is proper.

The complete proof of lemma 3.3 is given in Appendix D of the Supplementary Material. Here it is worth noticing that the properness of the posterior distribution in the context of Lemma 3.3 depends on the representation of the mixture model as a location-scale distribution, where the second component is defined with respect to a reference component: if we focus the attention on the part of the likelihood depending only on the second component, even if the prior is constant with respect to the difference between the location parameters $\delta$ as $\delta\rightarrow\pm\infty$ , the likelihood depends on $\delta$ as $\exp(-\frac{n-1}{2}\delta^{2})$ and therefore the behavior of the posterior distribution is convergent.

Figure 2 shows an approximation of the Jeffreys prior for the location parameters of a two-component Gaussian mixture model on a grid of values and confirms that the prior is constant on the difference between the means and takes higher and higher values as the difference between them increases, while the posterior distribution, even if showing the classical multimodal nature (Celeux et al., 2000), seems to concentrate around the true modes. It also appears to be perfectly symmetric because the other parameters (weights and standard deviations) have been fixed as identical.

The same proof cannot be extended to the general case of $k$ components, because the location parameters are defined as several distances from the reference location parameter: if we again focus the attention on the part of the likelihood depending on the second component, the integral with respect to $\delta_{2}$ converges, however the prior is constant with respect to any other $\delta_{j}$ ( $j=3,\cdots,k$ ) as $\delta_{j}\rightarrow\pm\infty$ and the integral does not converge with respect to the other differences. Then the following Lemma holds (the formal proof is available in Appendix E).

Lemma 3.4.

When $k>2$ , the posterior distribution derived from the Jeffreys prior is improper when only the location parameters of model (14) are unknown.

This result confirms the idea that each part of the likelihood gives information about at most the difference between the locations of the respective components and the reference location, but not on the locations of the other components.

We can now consider the case where all the parameters of (14) are unknown.

Theorem 3.1.

The posterior distribution of the parameters of a mixture model with location-scale components derived from the Jeffreys prior when all parameters of model (14) are unknown is improper.

The proof is available in Appendix F of the Supplementary Material.

4 A noninformative alternative to Jeffreys prior

The information brought by the Jeffreys prior or lack thereof does not seem to be enough to conduct inference in the case of mixture models. The computation of the determinant creates a dependence between the elements of the Fisher information matrix in the definition of the prior distribution which makes it difficult to find and justify moderate modifications of this prior that would lead to a proper posterior distribution. For example, using a proper prior for part of the scale parameters and the Jeffreys prior conditionally on them does not avoid impropriety, as it is shown Appendix G of the Supplementary Material.

The literature covers attempts to define priors that add a small amount of information that is sufficient to conduct the statistical analysis without overwhelming the information contained in the data. Some of these are related to the computational issues in estimating the parameters of mixture models, as in the approach of Casella et al. (2002), who finds a way to use perfect slice sampler by focusing on components in the exponential family and conjugate priors. A characteristic example is given by Richardson and Green (1997), who proposes weakly informative priors, which are data-dependent (or empirical Bayes) and are represented by flat normal priors over an interval corresponding to the range of the data. Nevertheless, since mixture models belong to the class of ill-posed problems, the influence of a proper prior over the resulting inference is difficult to assess.

Another solution found in Mengersen and Robert (1996) proceeds through the reparametrization (8) and introduces a reference component that allows for improper priors. This approach then envisions the other parameters as departures from the reference and ties them together by considering each parameter $\theta_{\ell}$ as a perturbation of the parameter of the previous component $\theta_{\ell-1}$ . This perspective is justified by the argument that the $(\ell-1)$ -th component may not be informative enough to absorb all the variability in the data. For instance, a three-component mixture model gets rewritten as

[TABLE]

where one can impose the constraint $1\geq\sigma_{1}\geq\sigma_{2}$ for identifiability reasons. Under this representation, it is possible to use an improper prior on the global location-scale parameter $(\mu,\tau)$ , while proper priors must be applied to the remaining parameters. This reparametrization has been used also for exponential components by Gruet et al. (1999) and Poisson components by Robert and Titterington (1998). Moreover, Roeder and Wasserman (1997) proposes a Markov prior which follows the same reasoning of dependence between the parameters for Gaussian components, where each parameter is again a perturbation of the parameter of the previous component $\theta_{\ell-1}$ . Kamary et al. (2017) also proposes a reparametrization of location-scale mixtures based on invariance that allows for weakly informative priors.

On one hand, this representation suggests to define a global location-scale parameter in a more implicit way, via a hierarchical model that considers more levels in the analysis and choose noninformative priors at the last level in the hierarchy.

On the other hand, we believe that an essential feature of a default prior is that it should let the analysis be able to identify the correct number of meaningful components, in particular in the standard case where an overfitted mixture is assumed because the a priori information on the number of components is weak.

We thus propose a prior scenario which combines both the hierarchical representation and the conservativeness property in terms of components.

More precisely, consider the Gaussian mixture model (1)

[TABLE]

The parameters of each component may be considered as related in some way; for example, the observations induce a reasonable range, which makes it highly improbable to face very different means in the above Gaussian mixture model. A similar argument may be used for the standard deviations.

Therefore, at the second level of the hierarchical model, we may write

[TABLE]

which indicates that the location parameters vary between components, but are likely to be close, and that the scale parameters may be smaller or larger than $\zeta_{0}$ ; we have decided to define both $\mu_{\ell}$ and $\sigma_{\ell}$ as depending on hyperparameter $\zeta_{0}$ without loss of generality, as one may notice by analysing mean and variance of the random variables; this representation allows the application of the MCMC scheme proposed in Robert and Mengersen (1999) which allows a better mixing of the chains. The mixture weights are given the prior distribution $\pi^{J}(p|\mu,\sigma)$ which is the Jeffreys prior for the weights, conditional on the location and scale parameters, given in Section 3.1; this choice makes use of the conservative property of the Jeffreys prior for the weights which is essential in the case of miss-specification of the number of components.

At the third level of the hierarchical model, the prior may be noninformative:

[TABLE]

As in Mengersen and Robert (1996) the parameters in the mixture model are considered tied together; on the other hand, this feature is not obtained via a constrained representation of the mixture model itself, but via a hierarchy in the definition of the model and the parameters.

Theorem 4.1.

The posterior distribution derived from the hierarchical representation of the Gaussian mixture model associated with (9), (4) and (11) is proper.

The proof of Theorem 4.1 is available in Appendix H of the Supplementary Material.

As a side remark, even if Theorem 4.1 is stated for Gaussian mixture models, it may be extended to other location-scale distributions. Section 6 will present an example with log-normal components, Section 6.1 with Gumbel components. However it cannot be generalized to any location-scale distribution.

This hierarchical version of the mixture model presents some advantages; in particular, the Jeffreys prior used for the weights is conservative in terms of number of components in the case of misspecification. We remind that when the number of components is not known, it is usual in practice to fix a model with a high number of components (if one wants to avoid a nonparametric analysis), therefore it is essential that the posterior distribution gives hints on the right $k$ . This feature of the Jeffreys prior allow the experimenter to do so in a noninformative way. More precisely, this hierchical prior respect the Assumption 5 of Rousseau and Mengersen (2011).

5 Simulation Study

In this Section we present the results of several simulations studies we conduct to support the theoretical results presented so far. The results of additional simulations are given in Appendix G and H of the Supplementary Material.

As a remark, integrals of the form (3) need to be approximated, as mentioned in Section 2. There are numerical issue here. We decided to use Riemann sums (with $550$ points) when the component standard deviations are sufficiently large, as they produce stable results, and Monte Carlo integration (with sample sizes of $1500$ ) when they are small. In the latter case, the variability of MCMC results seems to decrease as $\sigma_{i}$ approaches [math]. See the Supplementary Material for a detailed description of these computational issues.

We can analyse the property of conservativeness in overfitted mixtures through simulations, by using the hierarchical prior proposed in Section 4. We consider a very simple example to illustrate this theoretical result. Suppose we want to fit a two-component Gaussian mixture model with weights $p$ and $1-p$ and parameters unknown to a sample of data $\mathbf{x}=\{x_{1},\cdots,x_{n}\}$ generated from a standard normal distribution $\mathcal{N}(0,1)$ . We computed the posterior distribution for $M=20$ replications of samples of size $n=(50,100,500,1000,5000,10000)$ . Figure 3 shows that the posterior means of $p$ increases to $1$ as $n$ increases.

We have also considered a more complicated situation, where we want to fit a model with an increasing number of components ( $k=(2,3,4,5)$ ) to a data set $\mathbf{x}=\{x_{1},\cdots,x_{n}\}$ generated from a two-component mixture model

[TABLE]

Figures 4 and 5 show the boxplots for the posterior means of the weights obtained through $M=20$ replications of the experiment, with a correct ( $k=2$ ) or a misspecified ( $k=(3,4,5)$ ) model. It is clear that as the number of components increases, the additional weights are estimated by smaller and smaller values as the sample size increases. It is evident that the variability of the estimates (in repetitions of the experiment) is smaller when an exact number of components is assumed; however, in every case, the Bayesian analysis based on the Jeffreys prior is able to identify the right number of components. The higher variability in estimating the weights is reflected in the fact that, as the number of components increases, the estimated (and the predictive) densities are less and less smooth, nevertheless this feature is mitigated as the sample size increases, see Appendix H in the Supplementary Material.

6 Illustrations

In this Section we will analyse the performance of the approach proposed in Section 4 in three datasets so well-known in the literature of mixture models that they can be taken as benchmarks and in a new dataset we propose here for the first time. In order to better present this new dataset, the analysis of it is presented separately.

The first dataset contains data about the velocity (in km per second) of 82 galaxies in the Corona Borealis region. The goal of this analysis is to understand the number of stellar populations, in order to support a particular theory of the formation of the Galaxy. The Galaxy dataset has been investigated by several authors, including Richardson and Green (1997), Raftery (1996), Escobar and West (1995) and Roeder (1990) among others.

The galaxies velocities are considered as random variables distributed according to a mixture of $k$ normal distributions. The evaluation of the number of components has proved to be delicate, with estimates from 3 in Roeder and Wasserman (1997) to 5 in Richardson and Green (1997) and 7 in Escobar and West (1995).

We have assumed a ten-component mixture model and check whether or not the hierarchical approach that uses the conditional Jeffreys prior on the weights of the mixture model manages to identify a smaller number of significant components. The results are available in Figure 6 and Table 1. The algorithm identifies 5 components with weights larger than zero, which is a result along the line of Richardson and Green (1997) and more conservative than Escobar and West (1995), which confirms the Jeffreys prior’s feature of being conservative in the number of the components. Credible intervals also show that the parameters of the components with marginal posterior distributions for the weights not concentrated around zero are estimated with lower uncertainty.

The second dataset is related to a population study to validate caffeine as a probe drug to establish the genetic status of rapid acetylators and slow acetylators (Bechtel et al., 1993): many drugs, including caffeine, are metabolyzed by a polymorphic enzyme (EC 2.3.1.5) in humans and the white population is divided into two groups of slow acetylators and rapid acetylators. Caffeine is considered an interesting drug to study the phenotype of people, because it is regularly consumed by a large amount of the population. Several population studies have been conducted, some of them reporting a bimodality, some others a trimodality. We focus on the study presented by Bechtel et al. (1993), involving 245 unrelated patients and computing the molar ratio between two metabolites of caffeine, AFMU and 1X, both measured in urine 4 to 6 hours after ingestion of 200 mg of caffeine.

We have again assumed a ten-component mixture model and checked whether or not the hierarchical approach which uses the conditional Jeffreys prior on the weights of the mixture model is able to identify a smaller number of significant components.

The results are available in Figure 7 and Table 1. The algorithm identifies two components with weights clearly larger than zero and two other components with very small weights. Bechtel et al. (1993) identify a bimodal density, while Richardson and Green (1997) consider highly likely a 3-5 component mixture. The Jeffreys prior allows to concentrate the analysis on mainly two subgroups and it suggests that Gaussian components may be inappropriate in this setting: by looking to the location of the components with small weights, it may be more adequate to consider asymmetric distributions.

Our third dataset is related to measuring the acid neutralizing capacity (ANC) (in log-scale) of a sample of 155 lakes in north-central Wisconsin, to determine the number of lakes that have been affected by acidic deposition (Crawford et al., 1992): the ANC measures the capability of a lake to neutralize acid, i.e. low values may indicate a problem for the lake’s biological diversity.

The results are available in Figure 8 and Table 1. The algorithm identifies two components with significant weights and two other components with very small weights. Crawford et al. (1992) assume a bimodal density, while Richardson and Green (1997) consider highly likely a 3-5 component model. The Jeffreys prior again allows to concentrate the analysis on two main subgroups and suggests to investigate the importance of other two components and possibly the goodness-of-fit of the log-normal distribution in this setting.

6.1 Network dataset

A recent trend in computer network systems is the deployment of network functions in software Nunes et al. (2014). The so-called “software dataplanes” are emerging as an alternative to traditional hardware switched and routers, reducing costs and enhancing programmability.

The monitoring of IP packets is, among all possible network functions, one of the most suitable for a software deployment. However, the monitoring has a huge cost in terms of consumed CPU (processing) time by packet. The main reason for this is that each incoming packet triggers the retrieval, from a large hash-table, of all the information related to the packet flow (i.e. the packet’s family). This operation is generally called flow-entry retrieval. The time required for the flow-entry retrieval (retrieval time) mainly depends on whether such information is available in one of the processor caches (e.g. L1, L2, L3) or in memory.

The dataset used in this analysis consists of generated samples of retrieval time, each with $10^{6}$ times, under two different set-ups. In the first one, the flow-entry has been forced to reside in fast processor caches (“hit”). In the second one, all flow-entries have been forced to reside in the server RAM (memory), which results in a slower flow-entry retrieval (“miss”).

Both samples show a heavy tail, due to possible hash collisions on the table, as well as additional delays introduced by measuring the retrieval time at a nanosecond timescale. In the case of “miss”, another reason for the heavy tail can be identified with the virtual/physical memory mapping, which can inflate the retrieval time in some cases.

The goal of a realistic analysis is to infer the proportion of reported times which may be considered from the “hit” distribution and the proportion of times which may be considered from the “miss” distribution, i.e. to derive what is the percentage of packets for which the flow-entry was in the cache and the percentage of packets for which the flow-entry was in memory.

However, a first simulation is generally used to test the procedure. The interest of the analysis will be in the region of the space where the two distributions are overlapping, therefore the interest is not in the external tails, which may, nonetheless, affect inference. Therefore, a preliminary analysis may be conducted in order to understand if a part of the future observations may be discarded from the analysis. In this particular case, the conservative property of the Jeffreys prior may be used in order to understand how much important are the tails of each distribution and to identify the right models to use. For instance, a comparison between a Gaussian mixture model and a mixture model with Gumbel components may be run: if in both cases the analysis run with a Jeffreys prior for the mixture weights identifies more than two (assumed) distributions of interest, this may be a suggestion that the observations allocated to the external components (not the “hit” or the “miss” ones) may be discarded, providing inference on the proportion of observations to discard as well.

Figure 9 and Table 2 show the results of this analysis: adopting a Jeffreys prior for the mixture weights when assuming Gumbel components allows to better estimate the first component and to describe the asymmetry observed in the data as an asymmetry in the first component instead of an additional component. Nevertheless it is not sufficient to identify the observations in the right tail of the second component as part of its tail, since the algorithm identifies a third component located in that part of the space.

In this setting, the Jeffreys prior allows to i) identify a miss-specification of the model assumptions (the approximated Bayes factor of the mixture of Gumbel components against the mixture of normal components is $2.10$ ) and ii) identify which part of the observations to discard from further studies.

7 Conclusion

This thorough analysis of the Jeffreys priors in the setting of mixtures with location-scale components shows that mixture distributions deserve the qualification of an ill-posed problem with regard to the production of non-informative priors. Indeed, we have shown that most configurations for Bayesian inference in this framework do not allow for the standard Jeffreys prior to be taken as a reference. While this is not the first occurrence where Jeffreys priors cannot be used as reference priors, we have shown that the Jeffreys prior for the mixture weights has the important property to be conservative in the number of components, with a configuration compatible with the results of Rousseau and Mengersen (2011).This is a general feature of the Jeffreys prior for the mixture weights, which is independent from the shape of the distributions composing the mixture.

Nevertheless, we have decided to study its behavior in the specific case of components from location-scale families. We have proposed a hierarchical representation of the mixture model, which allow for improper priors at the highest level of the hierarchy and assumes the Jeffreys prior for the mixture weights in the second level, conditional on prior distributions for the location and scale parameters along the line of Mengersen and Robert (1996).

Through several examples, both on simulated and real datasets, we have shown that this representation seems to be more conservative on the number of components than other non or weakly informative prior distributions for mixture models available in the literature. In particular, it seems to be able to recognize the meaningful components, which is an essential property for a noninformative prior for mixture model: in fact, in an objective setting, it is essential to consider the possibility to have assumed a wrong number of components. In this sense, the Jeffreys prior for the mixture weights may be used to identify the meaningful components and possible miss-specifications of either the number or the distributional family of the components.

As a note aside, we have mainly analyzed mixture of Gaussian distributions in this paper, with extensions of the theoretical results to the other distributions of the location-scale family. Nevertheless, the possible difficulties deriving from the use of distributions different from the Gaussian are not considered here and will be the focus of future research. In particular, all likelihoods poorly specified and ill-behaved cases are more likely to meet difficulties. However, the Jeffreys prior is known as a regularization prior that does not necessarily reflect prior beliefs, but in combination with the likelihood function yields posteriors with desirable properties; see Hoogerheide and van Dijk (2008) for a detailed review of ill-behaved posterior cases and the role of the Jeffreys prior in those cases.

Acknowledgements and Notes

The code used for the Gaussian mixture models is available online at the following link: https://github.com/cgrazian/Jeffreys_mixtures.

The Authors want to thank Gioacchino Tangari, from the Department of Electronic and Electrical Engineering, University College London, for having provided the simulations of Section 6.1.

Supplementary Material

Appendix A: Form of the Jeffreys prior for the weights of the mixture model.

The shape of the Jeffreys prior for the weights of a mixture model depends on the type of the components. Figure 10, 11 and 12 show the form of the Jeffreys prior for a two-component mixture model for different choices of components. It is always concentrated around the extreme values of the support, however the amount of concentration around [math] or $1$ depends on the information brought by each component. In particular, Figure 10 shows that the prior is much more symmetric as there is symmetry between the variances of the distribution components, while Figure 11 shows that the prior is much more concentrated around 1 for the weight relative to the normal component if the second component is a Student t distribution.

Finally Figure 12 shows the behavior of the Jeffreys prior when the first component is Gaussian and the second is a Student t and the number of degrees of freedom is increasing. As expected, as the Student t is approaching a normal distribution, the Jeffreys prior becomes more and more symmetric.

Appendix B: Proof of Lemma 3.2

*When the parameters of a location-scale mixture model are unknown, the Jeffreys prior is improper, constant in $\mu$ and powered as $\tau^{-d/2}$ , where $d$ is the total number of unknown parameters of the components (i.e. excluding the weights). *

Proof.

We first consider the case where the means are the only unknown parameters of a Gaussian mixture model

[TABLE]

The generic elements of the expected Fisher information matrix are, in the case of diagonal and off-diagonal terms respectively:

[TABLE]

Now, consider the change of variable $t=x-\mu_{i}$ in the above integrals, where $\mu_{i}$ is thus the mean of the $i$ -th Gaussian component ( $i\in\{1,\cdots,k\}$ ). The above integrals are then equal to

[TABLE]

Therefore, the terms in the Fisher information only depend on the differences $\delta_{j}=\mu_{i}-\mu_{j}$ for $j\in\{1,\cdots,k\}$ . This implies that the Jeffreys prior is improper since a reparametrization in ( $\mu_{i},\mathbf{\delta}$ ) shows the prior does not depend on $\mu_{i}$ .

Moreover, consider a two-component mixture model with all the parameters unknown

[TABLE]

With some computations, it is straightforward to derive the Fisher information matrix for this model, partly shown in Table 3, where each element is multiplied for a term which does not depend on $\tau$ .

Therefore, the Fisher information matrix considered as a function of $\tau$ is a block matrix. From well-known results in linear algebra, if we consider a block matrix

[TABLE]

then its determinant is given by $\det(M)=\det(A-BD^{-1}C)\det(D)$ . In the case of a two-component mixture model where the total number of components parameters (i.e. non considering the weights) is $d=4$ , $\det(D)\propto\tau^{-4}$ , while $\det(A-BD^{-1}C)\propto 1$ (always interpreted as functions of $\tau$ only). Then the Jeffreys prior for a two-component Gaussian mixture model is proportional to $\tau^{-2}$ . If we generalize to the case of a Gaussian mixture model with $k$ components, the total number of component parameters is $d=2k$ and the Jeffreys prior for a $k$ -component Gaussian mixture model is proportional to $\tau^{-k}$ .

When considering the general case of components from a location-scale family, this feature of improperness of the Jeffreys prior distribution is still valid, because, once reference location-scale parameters are chosen, the mixture model may be rewritten as

[TABLE]

Then the second derivatives of the logarithm of model (14) behave as the ones we have derived for the Gaussian case, i.e. they will depend on the differences between each location parameter and the reference one, but not on the reference location itself. Then the Jeffreys prior will be constant with respect to the global location parameter and powered in the global scale parameter.

∎

Appendix C: Jeffreys prior for $\delta=\mu_{2}-\mu_{1}$

*The Jeffreys prior of $\delta$ conditional on $\mu$ when only the location parameters are unknown is improper. *

Proof.

When considering the reparametrization by Mengersen and Robert (1996), the Jeffreys prior for $\delta$ for a fixed $\mu$ has the form:

[TABLE]

and the following result may be demonstrated. The improperness of the conditional Jeffreys prior on $\delta$ depends (up to a constant) on the double integral

[TABLE]

The order of the integrals is allowed to be changed, then

[TABLE]

Define $f(x)=(1-p)e^{-\frac{x^{2}}{2}}=\frac{1}{d}$ . Then

[TABLE]

Since the behavior of $\left[d^{2}p\sigma\exp\{-\frac{\sigma^{2}(x+\frac{\delta}{\sigma\tau})^{2}}{2}\}+d\right]$ depends on $\exp\{-\delta^{2}\}$ as $\delta$ goes to $\infty$ , we have that

[TABLE]

because the integrand function is positive. Then

[TABLE]

Therefore the conditional Jeffreys prior on $\delta$ is improper.

∎

Figure 13 compares the behaviour of the prior and the resulting posterior distribution for the difference between the means of a two-component Gaussian mixture model: the prior distribution is symmetric and it has different behaviours depending on the value of the other parameters, but it always stabilizes for large enough values; the posterior distribution appears to always concentrate around the true value.

Appendix D: Proof of lemma 3.3

When $k=2$ , the posterior distribution derived from the Jeffreys prior when only the location parameters of model

[TABLE]

are unknown is proper.

Proof.

The conditional Jeffreys prior for the means of a Gaussian mixture model follow the behavior of the product of the diagonal elements of the Fisher information matrix:

[TABLE]

where $\delta=\mu_{2}-\mu_{1}$ .

The posterior distribution is then defined as

[TABLE]

The likelihood may be rewritten (without loss of generality, by considering $\sigma_{1}=\sigma_{2}=1$ , since they are known) as

[TABLE]

Then, for $|\mu_{1}|\rightarrow\infty$ , $L(\theta)$ tends to the term

[TABLE]

that is constant for $\mu_{1}$ . Therefore we can study the behavior of the posterior distribution for this part of the likelihood to assess its properness.

This explains why we want the following integral to converge:

[TABLE]

which is equal to (by the change of variable $\mu_{2}-\mu_{1}=\delta$ )

[TABLE]

In Appendix C of the Supplementary Material it is possible to see that the prior distribution only depends on the difference between the means $\delta$ :

[TABLE]

where $\pi^{J}(\delta)$ is defined as

[TABLE]

As $\delta\rightarrow\pm\infty$ this quantity is constant with respect to $\delta$ . Therefore the integral (16) is convergent for $n\geq 2$ .

∎

Appendix E: Proof of Lemma 3.4

*When $k>2$ , the posterior distribution derived from the Jeffreys prior when only the location parameters of model (14) are unknown is improper. *

Proof.

In the case of $k\neq 2$ components, the Jeffreys prior for the location parameters is still constant with respect to a reference mean (for example, $\mu_{1}$ ). Therefore it depends on the difference parameters $(\delta_{2}=\mu_{2}-\mu_{1},\delta_{3}=\mu_{3}-\mu_{1},\cdots,\delta_{k}=\mu_{k}-\mu_{1})$ .

The Jeffreys prior depends on the product on the diagonal

[TABLE]

If we consider the case as in Lemma 3.3, where only the part of the likelihood depending on e.g. $\mu_{2}$ may be considered, the convergence of the following integral has to be studied:

[TABLE]

In this case, however, the integral with respect to $\delta_{2}$ may converge, nevertheless the integrals with respect to $\delta_{j}$ with $j\neq 2$ will diverge, since the prior tends to be constant for each $\delta_{j}$ as $|\delta_{j}|\rightarrow\infty$ . ∎

Appendix F: Proof of Theorem 3.1

The posterior distribution of the parameters of a mixture model with location-scale components derived from the Jeffreys prior when all parameters of model (14) are unknown is improper.

Proof.

Consider a mixture model with components coming from the location-scale family. The proof will consider Gaussian components, however it may be generalized to any location-scale distribution.

Consider the elements on the diagonal of the Fisher information matrix; again, since the Fisher information matrix is positive definite, the determinant is bounded by the product of the terms in the diagonal.

Consider a reparametrization into $\tau=\sigma_{1}$ and $\tau\sigma=\sigma_{2}$ . Then it is straightforward to see that the integral of this part of the prior distribution will depend on a term $(\tau)^{-(d+1)}(\sigma)^{-d}$ , as seen in the proof of Lemma 3.2. The likelihood, on the other hand, is given by

[TABLE]

When composing the prior with the part of the likelihood which only depends on the first component, this part does not provide information about the parameters $\sigma$ and the integral will diverge.

In particular, the integral of the first part of the posterior distribution relative to the part of the likelihood dependent on the first component only and on the product of the diagonal terms of the Fisher information matrix for the prior when considering a two-component mixture model is

[TABLE]

When considering the integrals relative to the Jeffreys prior, they do not represent an issue for convergence with respect to the scale parameters, because exponential terms going to [math] as the scale parameters tend to [math] are present. However, when considering the part out of the previous integrals, a factor $\sigma^{-2}$ which causes divergence is present. Then this particular part of the posterior distribution does not integrate.

When considering the case of $k$ components, the integral inversily depends on $\sigma_{1},\sigma_{2},\cdots,\sigma_{k-1}$ which implies the posterior always is improper.

∎

Appendix G: Improperness of the posterior distribution deriving from the multivariate Jeffreys prior

Since the posterior distribution which follows from the use of the multivariate Jeffreys prior on the complete set of parameters is improper, we expect to see non-convergent behaviors in the MCMC simulations, in particular for small sample sizes. For small sample sizes, the chains tend to get stuck when very small values of standard deviations are accepted. Figure 14 shows the results for different sample sizes and different scenarios (in particular, the situations when the means are close or well separated from one another are considered) for a mixture model with two and three Gaussian components: sometimes the chains do not converge and tend towards very extreme values of means, sometimes the chains get stuck to very small values of standard deviations.

The improperness of the posterior distribution is not only due to the scale parameters: we may use a reparametrization of the problem as in Equation (8) and use a proper prior on the parameter $\sigma$ , for example, by following Robert and Mengersen (1999)

[TABLE]

and the Jeffreys prior for all the other parameters $(p,\mu,\delta,\tau)$ conditionally on $\sigma$ , and still face the same issue. Actually, using a proper prior on $\sigma$ does not avoid convergence trouble, as demonstrated by Figure 15, which shows that, even if the chains with respect to the standard deviations are not stuck around [math] when using a proper prior for $\sigma$ in the reparametrization proposed by Robert and Mengersen (1999), the chains with respect to the locations parameters demonstrate a divergent behavior.

These problems are overcome by the hierarchical prior proposed in Section 4: a simulation study (not shown) along the lines of the one just presented for the posterior distribution deriving from the multivariate Jeffreys prior confirms that the chains obtained via MCMC for 50 replications of the experiments always have a convergent behavior despite the posterior being improper.

Appendix H: The properness of the hierarchical representation of Theorem 4.1

*The posterior distribution derived from the hierarchical representation of the Gaussian mixture model associated with (9), (10) and (11) is proper. *

Proof.

Consider the composition of the three levels of the hierarchical model described in equations (9), (10) and (11):

[TABLE]

where $L(\cdot;\mathbf{x})$ is given by Equation (Proof.).

Once again, we can initialize the proof by considering only the first term in the sum composing the likelihood function for the mixture model. Then the product in (Proof.) may be split into four terms corresponding to the different terms in the scale parameters’ prior. For instance, the first term is

[TABLE]

and the second one

[TABLE]

The integrals with respect to $\mu_{1}$ , $\mu_{2}$ and $\mu_{0}$ converge, since the data are carrying information about $\mu_{0}$ through $\mu_{1}$ . The integral with respect to $\sigma_{1}$ converges as well, because, as $\sigma_{1}\rightarrow 0$ , the exponential function goes to [math] faster than $\frac{1}{\sigma_{1}^{n}}$ goes to $\infty$ (integrals where $\sigma_{1}>\zeta_{0}$ are not considered here because this reasoning may easily extend to those cases). The integrals with respect to $\sigma_{2}$ converge, because they provide a factor proportional to $\zeta_{0}$ and $1/\zeta_{0}$ respectively which simplifies with the normalizing constant of the reference distribution (the uniform in the first case and the Pareto in second one). Finally, the term $1/\zeta_{0}^{4}$ resulting from the previous operations has its counterpart in the integrals relative to the location priors. Therefore, the integral with respect to $\zeta_{0}$ converges.

The part of the posterior distribution relative to the weights is not an issue, since the weights belong to the corresponding simplex. ∎

Appendix I: Effect of the sample size in the conservativeness of the Jeffreys prior

This Appendix shows the estimation of the density (19) when a higher number of components is assumed, together with a Jeffreys prior for the weigths of the mixture for sample sizes $50,\,100,\,500,\,1,000$ , when the true model is

[TABLE]

Figures 16-19 show the $M=20$ resulting estimated densities against (19); as the number of components increases, the estimated densities are less and less smooth, nevertheless this feature is mitigated as the sample size increases.

Appendix J: Implementation Features

The computing expense due to derive the Jeffreys prior for a set of parameter values is in $\mathrm{O}(d^{2})$ if $d$ is the total number of (independent) parameters.

Each element of the Fisher information matrix is an integral of the form

[TABLE]

which has to be approximated. We have applied both numerical integration and Monte Carlo integration and simulations show that, in general, numerical integration obtained via Gauss-Kronrod quadrature, produces more stable results. Nevertheless, when the values of one or more standard deviations or weights are too small, either the approximations tend to be very dependent on the bounds used for numerical integration (usually chosen to omit a negligible part of the density) or the numerical approximation may not be even applicable. In this case, Monte Carlo integration seems to be more stable, where the stability of the results depends on the Monte Carlo sample size.

Figure 20 shows the value of the Jeffreys prior obtained via Monte Carlo integration of the elements of the Fisher information matrix for an increasing number of Monte Carlo simulations both in the case where the Jeffreys prior diverges (where the standard deviations are small) and where it assumes low values. The value obtained via Monte Carlo integration is then compared with the value obtained via numerical integration. The sample size relative to the point where the graph stabilizes may be chosen to perform the approximation. The number of Monte Carlo simulations needed to reach a fixed amount of variability may be chosen.

Since the approximation problem is one-dimensional, another numerical solution could be based on the Riemann sums; Figure 21 shows the comparison between the approximation to the Jeffreys prior obtained via Monte Carlo integration and via the sums of Riemann: it is clear that the Riemann sums lead to more stable results in comparison with Monte Carlo integration. On the other hand, they can be applied in more situations than the Gauss-Kromrod quadrature, in particular, in cases where the standard deviations are very small (of order $10^{-2}$ ). Nevertheless, when the standard deviations are smaller than this, one has to pay attention on the features of the function to integrate. In fact, the mixture density tends to concentrate around the modes, with regions of density close to 0 between them. The elements of the Fisher informtation matrix are, in general, ratios between the components’ densities and the mixture density, then in those regions an indeterminate form of type $\frac{0}{0}$ is obtained; Figure 22 represents the behavior of one of these elements when $\sigma_{i}\rightarrow 0$ for $i=1,\cdots,k$ .

Thus, we have decided to use the Riemann sums (with a number of points equal to $550$ ) to approximate the Jeffreys prior when the standard deviations are sufficiently large and Monte Carlo integration (with sample sizes of $1500$ ) when they are too small. In this case, the variability of the results seems to decrease as $\sigma_{i}$ approaches [math], as shown in Figure 23.

We have chosen to consider Monte Carlo samples of size equal to $1500$ because both the value of the approximation and its standard deviations are stabilizing.

An adaptive MCMC algorithm has been used to define the variability of the kernel density functions used to propose the moves. During the burnin, the variability of the kernel distributions has been reduced or increased depending on the acceptance rate, in a way such that the acceptance rate stay between $20\%$ and $40\%$ . The transitional kernel used have been truncated normals for the weights, normals for the means and log-normals for the standard deviations (all centered on the values accepted in the previous iteration).

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bechtel et al. (1993) Bechtel Y.C. , Bonaiti-Pellie, C. , Poisson, N. , Magnette, J. and Bechtel, P.R. (1993). A population and family study N‐acetyltransferase using caffeine urinary metabolites. Clinical Pharmacology & Therapeutics , 54(2) 134–141.
2Berger et al. (2009) Berger, J. , Bernardo, J. and D., S. (2009). Natural induction: An objective Bayesian approach. Rev. Acad. Sci. Madrid , A 103 125–159. (With discussion).
3Bernardo and Giròn (1988) Bernardo, J. and Giròn, F. (1988). A Bayesian analysis of simple mixture problems. In Bayesian Statistics 3 (J. Bernardo, M. De Groot, D. Lindley and A. Smith, eds.). Oxford University Press, Oxford, 67–78.
4Casella et al. (2002) Casella, G. , Mengersen, K. , Robert, C. and Titterington, D. (2002). Perfect slice samplers for mixtures of distributions. J. Royal Statist. Society Series B , 64(4) 777–790.
5Celeux et al. (2000) Celeux, G. , Hurn, M. and Robert, C. (2000). Computational and inferential difficulties with mixture posterior distributions. J. American Statist. Assoc. , 95(3) 957–979.
6Crawford et al. (1992) Crawford, S.L. , De Groot, M.H. , Kadane, J.B. and Small, M.J. (1992). Modeling Lake-Chemistry Distributions: Approximate Bayesian Methods for Estimating a Finite-Mixture Model Technometrics , 34(4) 441–453.
7Diebolt and Robert (1994) Diebolt, J. and Robert, C. (1994). Estimation of finite mixture distributions by Bayesian sampling. J. Royal Statist. Society Series B , 56 363–375.
8Escobar and West (1995) Escobar, M.D., and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the american statistical association , 90 (430) 577–588.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Jeffreys priors for mixture estimation: properties and alternatives

Abstract

keywords:

1 Introduction

2 Jeffreys priors for mixture models

3 Characterization of the Jeffreys priors for mixture models and respective posteriors

3.1 Weights of mixture unknown

Lemma 3.1**.**

Proof.

3.2 Weights, location and scale parameters of a mixture model unknown

Lemma 3.2**.**

Lemma 3.3**.**

Lemma 3.4**.**

Theorem 3.1**.**

4 A noninformative alternative to Jeffreys prior

Theorem 4.1**.**

5 Simulation Study

6 Illustrations

6.1 Network dataset

7 Conclusion

Acknowledgements and Notes

Supplementary Material

Appendix A: Form of the Jeffreys prior for the weights of the mixture model.

Appendix B: Proof of Lemma 3.2

Proof.

Appendix C: Jeffreys prior for δ=μ2−μ1\delta=\mu_{2}-\mu_{1}δ=μ2​−μ1​

Proof.

Appendix D: Proof of lemma 3.3

Proof.

Appendix E: Proof of Lemma 3.4

Proof.

Appendix F: Proof of Theorem 3.1

Proof.

Appendix G: Improperness of the posterior distribution deriving from the multivariate Jeffreys prior

Appendix H: The properness of the hierarchical representation of Theorem 4.1

Proof.

Appendix I: Effect of the sample size in the conservativeness of the Jeffreys prior

Appendix J: Implementation Features

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Lemma 3.4.

Theorem 3.1.

Theorem 4.1.

Appendix C: Jeffreys prior for $\delta=\mu_{2}-\mu_{1}$