Bayesian variance estimation in the Gaussian sequence model with partial information on the means
Gianluca Finocchio, Johannes Schmidt-Hieber

TL;DR
This paper investigates Bayesian variance estimation in the Gaussian sequence model with partial mean information, revealing that certain priors lead to inconsistent posteriors while hierarchical Gaussian mixture priors achieve consistency and improved estimators.
Contribution
It demonstrates the inconsistency of the marginal posterior under i.i.d. priors and introduces hierarchical Gaussian mixture priors that ensure consistency and better estimation performance.
Findings
Posterior is inconsistent for i.i.d. priors on means.
Hierarchical Gaussian mixture priors achieve posterior consistency.
Bayesian estimators outperform the MLE in simulations.
Abstract
Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By developing a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises…
| Estim. | |||||
|---|---|---|---|---|---|
| 10 | 0.414 ( 8.7e-03) | 0.411 ( 8.6e-03) | 0.386 ( 8.2e-03) | 0.399 ( 8.4e-03) | |
| 100 | 0.040 ( 5.9e-04) | 0.040 ( 5.9e-04) | 0.390 ( 5.7e-04) | 0.041 ( 6.4e-04) | |
| 1000 | 0.004 ( 5.7e-05) | 0.004 ( 5.6e-05) | 0.004 ( 5.8e-05) | 0.004 ( 5.8e-05) | |
| 10 | 0.235 ( 3.1e-03) | 0.268 ( 4.2e-03) | 0.336 ( 6.2e-03) | 0.399 ( 8.4e-03) | |
| 100 | 0.028 ( 3.8e-04) | 0.031 ( 4.2e-04) | 0.037 ( 5.2e-04) | 0.041 ( 6.4e-05) | |
| 1000 | 0.003 ( 4.3e-05) | 0.003 ( 4.4e-05) | 0.004 ( 5.4e-05) | 0.004 ( 5.8e-05) | |
| 10 | 0.337 ( 3.3e-03) | 0.330 ( 4.6e-03) | 0.359 ( 6.9e-03) | 0.398 ( 8.3e-03) | |
| 100 | 0.036 ( 4.3e-04) | 0.032 ( 4.2e-04) | 0.034 ( 4.7e-04) | 0.041 ( 6.3e-04) | |
| 1000 | 0.003 ( 4.9e-05) | 0.003 ( 4.5e-05) | 0.003 ( 4.9e-05) | 0.004 ( 5.8e-05) | |
| 10 | 0.167 ( 2.1e-03) | 0.182 ( 3.8e-03) | 0.232 ( 5.9e-03) | 0.283 ( 7.0e-03) | |
| 100 | 0.040 ( 4.5e-04) | 0.034 ( 4.3e-04) | 0.034 ( 4.7e-04) | 0.041 ( 6.2e-04) | |
| 1000 | 0.004 ( 5.1e-05) | 0.003 ( 4.6e-05) | 0.003 ( 4.9e-05) | 0.004 ( 5.8e-05) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Bayesian variance estimation in the Gaussian sequence model with partial information on the means
Gianluca Finocchiolabel=e1 [
mark][email protected]
Johannes Schmidt-Hieberlabel=e2 [
mark][email protected]
University of Twente,
University of Twente
Abstract
Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By developing a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises theorem, the limit is non-Gaussian. We show that the Bayesian analysis leads to new statistical estimators outperforming the correctly calibrated MLE in a numerical simulation study.
frequentist Bayes,
maximum likelihood,
semiparametric inference,
Gaussian sequence model,
Bernstein-von Mises theorems,
keywords:
\arxiv
arXiv:1904.04525 \startlocaldefs
\endlocaldefs
and
1 Introduction
For given suppose we observe independent and normally distributed random variables
[TABLE]
The parameters in the model are and The goal is to estimate the variance while treating the mean vector as nuisance. For we recover the Gaussian sequence model. For this can be viewed as the Gaussian sequence model with additional knowledge that the means of the first observations are known (in which case we can subtract them from the data).
One can think of model (1.1) as a simple prototype of a combined dataset. Using for instance different measurement devices, one often faces merged datasets collected from multiple sources. The different sources might not be of the same quality concerning the underlying parameter, see [24] for an example. An alternative viewpoint is to interpret model (1.1) as a sparse sequence model with known support. Since a -fraction of the data is perturbed, we are in the dense regime. Knowledge of the support is then crucial as otherwise there is no consistent estimator for
If is even and then (1.1) is equivalent to the Neyman-Scott model [25] up to a reparametrization. Model (1.1) is in this case equivalent to observing and for Since and are independent, this is thus equivalent to observing independent random variables with Estimation of in the latter model is known as Neyman-Scott problem.
Although can be estimated with parametric rate based on the first observations, a striking feature of the model is that the MLE for is inconsistent. In fact the MLE converges to therefore underestimating the true variance by the factor The reason is that the likelihood of the observations with non-zero mean significantly affects the total likelihood viewed as a function in
We study what happens when a Bayesian approach is implemented for the estimation of the variance and whether the posterior distribution can correct for the bias of the MLE. The Bayesian method can be viewed as a weighted likelihood method: instead of taking the parameter with the largest likelihood, the posterior puts mass on parameter sets with large likelihood. Because of this, the posterior can in some cases correct the flaws of the MLE. An example are irregular models, see [15, 11, 26].
In the first part of the paper, we prove that whenever the nuisances are independently generated from a proper distribution, the posterior does not contract around the true variance. This shows that, for a large class of natural priors, the Bayesian method is unable to correct the MLE. In frequentist Bayes, several lower bound techniques have been developed in order to describe when Bayesian methods do not work, [4, 8, 9, 29, 10, 19]. These results can be used for instance to show that a certain decay of the prior is necessary to ensure posterior contraction. Our lower bounds are of a different flavor and do not require a condition on the tail behavior.
Since for the non-zero means no additional structure is assumed, there is no way to say something about one mean from knowledge of all the other means. Therefore, one might be tempted to think that a correlated prior on the means cannot perform better than an i.i.d. prior and consequently must lead to an inconsistent posterior as well. Surprisingly, this is not true and we construct in the second part of the article a Gaussian mixture prior for which the posterior contracts with the parametric rate around the true variance. For this prior we derive the limit distribution in the Bernstein-von Mises sense. In contrast with the classical Bernstein-von Mises theorem, the posterior limit is non-Gaussian in the case of small means. In this case the posterior also incorporates information about the second part of the sample into the estimator and we show in a simulation study that the maximum a posteriori estimate based on the limit distribution outperforms the -consistent estimator that only uses the observations with zero mean.
Estimation of the variance in model (1.1) can also be interpreted as a semiparametric problem. The results in this article therefore contribute to the recent efforts to understand frequentist Bayes in semiparametric models. Semiparametric Bernstein-von Mises theorems are derived under various conditions in [27, 5, 3, 7]. For specific priors, it has been observed that there can be a large bias in the posterior limit, see [6, 7, 26]. In all the cases studied so far, it is unclear whether the bias is due to the specific choice of prior or whether this is a fundamental limitation of the Bayesian method. To the best of our knowledge, our results show for the first time, that the posterior can be inconsistent for all natural priors.
Related to model (1.1), [14] studies Bayes for variance estimation of the errors in the nonparametric regression model. It is shown that if the posterior contracts around the true regression function with rate the marginal posterior for the variance contracts with parametric rate around the true error variance and a Bernstein-von Mises result holds.
The article is organized as follows. In Section 2, we discuss aspects of the problem related to the likelihood and the posterior distribution. A crucial identity for the log-posterior is derived in Section 3. This leads then to the general negative result in Section 4. The Gaussian mixture prior with parametric posterior contraction is constructed in Section 5. This section also contains the limiting shape result and a numerical simulation study. All proofs are deferred to the appendix.
Notation: For a vector , we write and for the averages of the squares (not to be confused with the squared averages). We write and . The probability and expectation induced by model (1.1) are denoted by and
2 Likelihood and posterior
The MLE. For the subsequent analysis, it is convenient to split the data vector in the part with zero means and the observations with non-zero means such that The likelihood function of the model is
[TABLE]
Maximizing over yields the MLE
[TABLE]
If only based on the subsample the MLE for would be and this converges to with the parametric rate Hence converges to The MLE for is therefore inconsistent and misses the true parameter by a factor It is clear that there is very little extractable information about the parameter in A frequentist estimator can simply discard and only use the subsample The MLE also does this but leads to an incorrect scaling of the estimator.
The incorrect scaling factor of the MLE can be explained in different ways. One interpretation is that the MLE can be written as
[TABLE]
with the MLE based on the subsample and the MLE based on the subsample The fact that the overall MLE just forms a linear combination of the MLEs for the subsamples shows again that too much weight is given to
Another explanation for the incorrect scaling of the MLE is to observe that in (2.1) the likelihood based on the second subsample is if If we would take the likelihood only over the first part of the sample we would obtain the optimal estimator but since the likelihhod over the full sample is the product of the likelihood functions for and an additional factor occurs in the overall likelihood which leads to the incorrect scaling. More generally, we conjecture that likelihood methods do not perform well for combined datasets where one part of the data is informative about a parameter and the other part is affected by nuisance parameters.
Adjusted profile likelihood. For the profile likelihood, we first compute the maximum likelihood estimator of the nuisance parameter for fixed denoted by, say and then maximize
[TABLE]
Obviously for any and the profile likelihood estimator coincides with the MLE for in the Neyman-Scott problem. If the parameter of interest and the nuisance parameters are orthogonal with respect to the Fisher information, that is,
[TABLE]
the adjusted profile likelihood estimator [12, 23, 13] is the maximizer of
[TABLE]
for the matrix valued function
[TABLE]
and the determinant. It is easy to check that (2.3) holds for model (1.1). Since -\partial^{2}/(\partial\mu_{j}\partial\mu_{\ell})\ \log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}=\sigma^{-2}\mathbf{1}(j=\ell), the adjusted profile likelihood estimator for coincides with the MLE for the subsample
[TABLE]
In particular, the adjusted profile likelihood results in an unbiased -consistent estimator for
The posterior distribution. From a Bayesian perspective it is quite natural to draw and the mean vector from independent distributions. Due to the orthogonality with respect to the Fisher information (2.3), we expect no strong interactions of and the mean parameters in the likelihood that could be taken care of by a dependent prior. Suppose that and that the prior for has Lebesgue density The marginal posterior distribution is then given by Bayes formula
[TABLE]
with
[TABLE]
In [28] it has been argued that by using multivariate Laplace approximation,
[TABLE]
with the adjusted profile likelihood in (2.4). This suggests that the posterior distribution should be centered around the adjusted profile likelihood estimator therefore correcting the MLE.
Associated sequence model with random means. For the Gaussian sequence model with partial information (1.1) equipped with the product prior define the associated sequence model with random means, where we observe independent random variables
[TABLE]
with and known. In this model, the nuisance parameters are replaced by additional randomness. The only parameter in this model is and the model is therefore parametric.
Remark 2.1*.*
The likelihood function of model (2.8) is Model (1.1) and model (2.8) lead therefore to the same formula for the posterior distribution of in terms of
Bayes with improper uniform prior. If the prior on the mean vector in the Bayes formula is chosen as the Lebesgue measure, the formula for the posterior simplifies to
[TABLE]
This is the same posterior we would get if we discarded the subsample It follows from the parametric Bernstein-von Mises theorem that if is positive and continuous in a neighbourhood of the posterior contracts around the true variance Notice that in the case of uniform prior, the Laplace approximation in (2.7) is exact and does not involve any remainder terms. Obviously the Lebesgue measure is not a probability measure and the prior is improper. This raises then the question whether there are also proper priors for which the marginal posterior is consistent on the whole parameter space. We will address this problem in the next sections.
3 On the derivative of the log-posterior
We first derive a differential equation for the posterior. Denote by the posterior distribution of for the sample that is,
[TABLE]
In particular, we set
[TABLE]
The quantity measures the spread of around the vector Recall moreover the definition of in (2.6).
Proposition 3.1**.**
The marginal posterior satisfies
[TABLE]
By Remark 2.1, the right hand side is a closed-form expression of the score function for in the random means model (2.8). If the MLE in (2.8) does not lie on the boundary, the score function vanishes at the MLE. From the Bernstein-van Mises phenomenon it is conceivable that the posterior will concentrate around this MLE. For the MLE to be close to the truth the score function evaluated at must be Since this leads to the condition
[TABLE]
In the next section, we derive a very general negative result. The main part of the argument is to show that the previous equality does not hold in a neighborhood of see (A.12).
4 Posterior inconsistency for product priors
In this section we study posterior contraction under the following condition.
Prior. The prior on is independent of the prior on Under the prior, each component of the mean vector is drawn independently from a distribution on The prior on has a positive and continuously differentiable Lebesgue density on
So far, denoted the prior on the mean vector. By a slight abuse of language we denote the prior on the individual components also by The assumptions on the prior are mild enough to account for proper priors with heavy tails and possibly no moments.
The i.i.d. prior is the natural choice, if we believe that there is no structure in the non-zero means. From (2.8) it follows that the corresponding sequence model with random means is
[TABLE]
with For and unknown this model has been studied in [21]. It is shown that the MLE for and the MLE for the distribution function of the means are consistent. Since the random means model leads to the same posterior distribution as explained in Remark 2.1, this suggests that the posterior might concentrate around the truth.
We now provide a second heuristic that leads to a different conclusion indicating that it makes a huge difference whether the distribution of the means is known or unknown. In the framework of (4.1), is known. If then and so we have This means that model (4.1) carries a lot of information about in the sense that can be estimated with parametric rate from the subsample only. Since the posterior only sees model (4.1) it is therefore natural to give a lot of weight to the subsample as well, which, from a frequentist perspective, is wrong.
This heuristic does not say anything about heavy-tailed priors with But even in this case, we will show that the posterior is inconsistent. The first result states that in a neighborhood of the posterior is increasing extremely fast with high probability.
Proposition 4.1**.**
Given and the prior above, then, for all sufficiently large there exists a mean vector such that
[TABLE]
The proof of Proposition 4.1 constructs a lower bound on that is independent of and moreover guarantees that has sufficiently small mass outside It therefore depends on the tail behavior of the prior mean distribution The mean vector is subsequently chosen with all means being equal to an expression only depending on Thus the means in are uniformly bounded and independent of as well.
Suppose that almost all posterior mass is close to By the previous proposition, the posterior is increasing at least up to Hence, there must be even more mass around This is a contradiction and shows that the posterior does not concentrate around The proof of the next theorem is based on this argument. For this result, the means in the vector can again be chosen to be uniformly bounded.
Theorem 4.2**.**
Given and the prior above, then, for all sufficiently large there exists a mean vector such that
[TABLE]
Consequently, the posterior is inconsistent and assigns all its mass outside of a neighbourhood of the true variance.
The posterior is therefore inferior if compared to the frequentist variance estimator which achieves the parametric rate in the sense that
[TABLE]
It is remarkable that no conditions on the tail behavior of the prior distribution are required for Theorem 4.2. Recall that for the improper uniform prior the posterior contracts around This shows that for distributions with heavy tailed densities, very sharp bounds are required.
To the best of our knowledge there are no negative results in the nonparametric Bayes literature that hold for such a large class of priors. The proof strategy to establish Proposition 4.1 is based on a highly non-standard shrinkage argument that will be sketched here. By expanding the square term in (3.2) we can lower bound (3.3) by
[TABLE]
where For close to , we have
[TABLE]
For an improper uniform prior, one can check that , making the lower bound negative and useless. For a proper prior, there is a shrinkage phenomenon in the sense that for any there are parameters such that , with high probability. If this is the case then
[TABLE]
which yields the conclusion by choosing small enough.
In Proposition 4.1 we showed that the posterior overshoots the true variance whenever the true means are large enough. By analyzing the Gaussian case in the next section, we see that for small means the posterior will in fact underestimate and that only for a small range of mean vectors, one can hope that the posterior will be able to concentrate around the true variance.
5 Gaussian mixture priors
5.1 Gaussian priors
To illustrate our approach, we first consider an i.i.d. Gaussian prior on the mean vector
[TABLE]
From Theorem 4.2 we already know that the posterior will be inconsistent in this case. Nevertheless, the Gaussian assumption yields more explicit formulas and this allows us to build a hierarchical prior resulting in good posterior contraction properties. By Remark 2.1, the marginal likelihood is the same as in the sequence model with random means (4.1). The marginal posterior is therefore
[TABLE]
which can also be written as the product of two inverse Gamma densities. In view of the Bernstein-von Mises phenomenon, the posterior concentrates around the MLE for parametric problems. Similarly, we can argue here that the posterior will be concentrated around the value maximizing the likelihood part of the posterior (5.1). By differentiation, we find and rearranging yields
[TABLE]
This can be rewritten as
[TABLE]
where we set
[TABLE]
and suppress the dependence of the term on and Since is fixed, this shows that for we need
[TABLE]
Differently speaking, to force the maximum to be close to the variance of the prior has to match the empirical variance of the nuisance parameter. We can also deduce from (5.2) that if and is fixed, then also More precisely, we even have that implies and implies This shows that, depending on the size of compared to the posterior can either overestimate or underestimate the true variance.
If is allowed to vary with , we can make the right hand side in (5.2) arbitrarily small by letting tend to infinity. As is the variance of the prior, the behavior resembles then that of the uniform improper prior, which, as we already know, leads to posterior consistency. If we think of a prior as a prior belief on the parameters, then the prior should not change depending on the amount of available data and, in particular, it is unnatural that the prior becomes more vague if the sample size increases. In the next section we show that there are sample size independent mixture priors leading to parametric posterior contraction rates.
5.2 Mixture priors
Section 4 explains the posterior inconsistency for an i.i.d. prior on the nuisance. It seems unintuitive that introducing dependency on the prior of the nuisance parameter can help avoiding posterior inconsistency for Surprisingly, this is not true. In this section, we first provide some intuition why mixture priors can resolve the issues of i.i.d. priors. Afterwards, we discuss and analyze a specific prior construction.
Analyzing Gaussian priors above, (5.3) suggests that for any nuisance parameter vector there exists an i.i.d. prior which seems to work. This i.i.d. prior does, however, depend on the unknown and can therefore not be chosen without knowledge of the data. Intuitively, if the posterior had the chance to see all possible i.i.d. priors on instead of just one, it is conceivable that it would automatically select one that is adapted to the unknown nuisance parameter and consequently leads to posterior consistency for the parameter of interest. De Finetti’s theorem [18] states that an exchangeable prior over the infinite sequence can be written as a mixture over i.i.d. priors in the sense that
[TABLE]
with a probability measure on the set of probability densities on . Assuming interchangeability of the integrals, the posterior (2.5) then becomes
[TABLE]
where denotes the probability density function of Let be the i.i.d. prior maximizing the interior integral. Suppose that this is a unique maximum and that the outer integral is determined by the behavior of the integrand in a suitable neighborhood of This means that
[TABLE]
The right hand side is the posterior density of for i.i.d. prior on the components.
Although this argument is only a sketch, it suggests that something might be gained by mixing over i.i.d. priors instead of just choosing one. Maximizing the marginalized likelihood in (5.1) over yields
[TABLE]
if the r.h.s. is non-negative. For this choice of (5.1) becomes The posterior therefore coincides with the posterior density based on the first part of the sample only, which we know has good posterior contraction properties.
Prior. In a first step generate , with a positive Lebesgue density on Given each non-zero mean is drawn independently from a centered normal distribution with variance that is,
Another heuristic about the posterior properties for this prior can again be derived by making the link to the associated sequence model with random means (2.8). For the prior considered here, the random means model has the form
[TABLE]
with If were a second parameter and not generated from the variance would not be identifiable if only the ’s are observed. In model (5.5) we know the density but this is not enough to consistently reconstruct from the subsample By Remark 2.1, this model leads to the same posterior for The posterior should therefore realize that there is little extractable information about in and discard these observations. We will see in the limiting shape result below that this is roughly what happens.
We denote by and the log-likelihoods of the sub-samples and coming from model (5.5) with replacing that is
[TABLE]
The log-likelihoods appearing in (5.6) can be written in terms of inverse-gamma distributions. We denote by the inverse-gamma distribution with shape and scale The corresponding p.d.f. is
[TABLE]
where is the Gamma function. Rewriting the posterior, we have that
Lemma 5.1**.**
Under the Gaussian mixture prior, the marginal posterior density has the form
[TABLE]
with and The distribution has mode and variance whereas the distribution has mode and variance
Starting from Lemma 5.1, we can develop a heuristic argument on how to recover the shape of the limit posterior distribution. We interpret the posterior with density (5.8) as the marginalized version, over the set of the distribution whose density is given by
[TABLE]
and refer to as the joint posterior on The first step is double localization. Thanks to the exponential tails of the inverse Gamma distribution, the joint posterior asymptotically concentrates on the set with a -ball centered at and a -ball around for a sequence This also implies that the joint posterior (5.9) is arbitrarily close, in total variation distance, to the truncated posterior distribution with density In particular, this means that the hyperparameter concentrates on a neighborhood of the maximal value derived in (5.4).
Arguing as in the classical proof of the Bernstein-von Mises theorem, we can then show that the truncated posterior distribution will asymptotically not depend on the prior and prove that the posterior given by (5.8) behaves asymptotically like
[TABLE]
Using essentially Laplace approximation, we show that the log-likelihoods and in (5.6) can be always uniformly approximated by a second-order Taylor expansion around their maxima and and thus the localized posterior converges in total variation distance to a distribution with density
[TABLE]
whose factors are a truncated Gaussian density with mode and variance and the integral of a truncated Gaussian density with mode and variance By undoing the localization argument, we can show that the restriction to the sets and can be removed from (5.11) and the posterior given by (5.8) converges in total variation distance to the posterior limit distribution
[TABLE]
with the c.d.f. of the standard normal distribution. Recall that This suggests that the term involving in the posterior limit distribution should asymptotically disappear if The limit of the posterior should then be the truncated Gaussian
[TABLE]
with mode and variance
The next result is a formal statement of the arguments mentioned above. To pass to (5.13) involves an additional -factor in the signal strength of Denote by the total variation distance and recall that the expectation is taken with respect to model (1.1).
Theorem 5.2**.**
Let and be the distributions corresponding to the densities (5.12) and (5.13), respectively. If the prior densities are positive and uniformly continuous, then, for any compact sets and
[TABLE]
Moreover, if then
[TABLE]
As a corollary of the proof, posterior contraction around the true variance with contraction rate can be established. In the case of large means this is an immediate consequence of the posterior limit and the parametric Bernstein-von Mises theorem. For small means it is less obvious because of the non-standard limit of the posterior.
Corollary 5.3**.**
There exists a constant such that
[TABLE]
The posterior limit distribution is closely related to the class of skew normal distributions, see [1, 2]. We now derive an alternative characterization of the limit distribution. From the argumentation above, the p.d.f.
[TABLE]
can be viewed as the joint posterior limit of In particular, the posterior limit distribution is the marginal distribution with respect to As this is clear from the context, we do not write explicitly that the following distributions are conditional on that is, are assumed to be fixed.
Lemma 5.4**.**
Let
[TABLE]
be independent. The distribution with p.d.f. (5.14) coincides with the distribution of
[TABLE]
In particular, the posterior limit distribution coincides with the distribution of
[TABLE]
If the standard deviations of are small compared to the means, the posterior limit distribution essentially compares the means and This behavior is very reasonable because if is small, and the subsample becomes informative about
The posterior limit depends on unknown quantities. A frequentist estimator mimicking the posterior would be to estimate from the MLE for zero means in the case that the means are small. To detect whether small means are present, we can check whether which leads then to the estimator
[TABLE]
5.3 Finite sample analysis
We compare the estimators and to the maximum and the mean of the limit density for sample sizes As discussed above, we expect to see some differences for small means. We study the performances for and the vector with all entries equal to for the values Since does not depend on the means, the estimator performs equally well in all setups. Table 1 reports the average of the squared errors and the corresponding standard errors based on repetitions. The rescaled MLE performs worse than any of the other estimators for small signals. Among the other estimators there is no clear ’winner’. For the risk of all estimators is nearly the same. For larger values of our simulation experiments did not show any changes compared to and the results are therefore omitted from the table.
There has been a long-standing debate whether Bayesian methods perform well if interpreted as frequentist methods. Results like the complete class theorem and the Bernstein-von Mises theorem have been foundational in this regard, see [22, 16]. Our theory highlights another instance where Bayes leads to new estimators with good finite sample properties. The analysis moreover shows that the construction of a prior resulting in a posterior with good frequentist properties can be highly non-intuitive.
Appendix A Proofs
A.1 Proofs for Section 3
Proof of Proposition 3.1.
By direct computation,
[TABLE]
Since
[TABLE]
we recover (3.3). ∎
A.2 Proofs for Section 4
Proof of Proposition 4.1.
It is enough to show that the following statements hold for sufficiently large sample size Let Since is a distribution function for We work on where is chosen such that
[TABLE]
and denotes the fraction of known zero means in the model. Notice that
[TABLE]
Let
[TABLE]
We choose the non-zero means to be
[TABLE]
The interval is compact and the prior is continuous and positive on Since we also assumed that is continuous, we find that
[TABLE]
for all sufficiently large With (3.3) and (A.2),
[TABLE]
Using (3.1) and (3.2), we expand
[TABLE]
Since the integrands in the latter display are positive for we can set and bound
[TABLE]
As a next step in the proof, we show
[TABLE]
To prove this inequality, we distinguish the cases and decomposing
[TABLE]
with
[TABLE]
For the term of (A.8), observe that If and therefore,
[TABLE]
Next, we bound the term in (A.8). In the sequel, we frequently make use of the fact that The idea is to split the domain of integration into sets and The contribution of the first part can be bounded by More work is needed for the second part. By expanding the square in the exponent, the -terms in the numerator and denominator cancel against each other, as they do not depend on and we have
[TABLE]
We now treat numerator and denominator separately. For the numerator, the function attains its maximum at and is bounded by This means that where the last step follows from (A.2). Together with (A.2), we obtain
[TABLE]
using to bound the exponent in the integral. To derive a lower bound of the denominator, we replace the integral over by an integral over On this interval, and since We obtain
[TABLE]
Combining this with the upper bound for the numerator yields, with (A.1), (A.3) and the definition of the function
[TABLE]
Together with (A.9) and (A.7),
[TABLE]
With we finally obtain (A.6).
In a final step of the proof, we derive, on an event with large probability, a deterministic lower bound for the right hand side in (A.6). Let be independent random variables. Rewriting Chebyshev’s inequality yields We aply this with Recall that Therefore, and For the variance, and Since by assumption Chebyshev’s inequality yields then when for the set
[TABLE]
On we have using (A.3), (A.6) and
[TABLE]
The assertion follows with (A.5). ∎
Proof of Theorem 4.2.
Proposition 4.1 shows that
[TABLE]
has -probability tending to one. This means that for with we must have Exponentiating this inequality for yields
[TABLE]
and this completes the proof since is equivalent to ∎
A.3 Proofs for Section 5
Proof of Lemma 5.1.
We can write the posterior as
[TABLE]
By using (5.6) and (5.7) we obtain (5.8). ∎
We now prepare for the proof of the limiting shape result. From (5.8), the density (5.9) of the joint posterior is
[TABLE]
With
[TABLE]
define
[TABLE]
It is shown below that the posterior concentrates on and The posterior can consequently be approximated by the distribution defined through its density (5.10). On the localized set we are able to replace the log-likelihoods by a quadratic expansion. This then allows us to approximate the posterior by which is defined as the distribution with density (5.11). We now state the single steps formally and provide the proofs.
Proposition A.1**.**
If the prior densities are positive and uniformly continuous, then there exists a sequence of sets such that for any compact sets
- (i)
**
- (ii)
With as defined in (A.15), we have for
[TABLE]
- (iii)
For
[TABLE]
- (iv)
For
[TABLE]
- (v)
For
[TABLE]
- (vi)
For and ,
[TABLE]
Proof of Proposition A.1.
Recall the definition of in (A.14) and set
[TABLE]
Let Define the event
[TABLE]
Since this implies in particular that on
Proof of (i): We simplify the notation by introducing the events
[TABLE]
so that . Thus We show that both and tend to zero uniformly over compact sets of parameters. By Chebyshev’s inequality,
[TABLE]
Since
[TABLE]
we find
[TABLE]
with Notice that is a finite constant since and are compact sets. Because the previous probability tends to zero as increases. We now bound . Rewriting we obtain
[TABLE]
and again by Chebyshev’s inequality
[TABLE]
which again tends to zero for uniformly over
Proof of (ii): We work on the event defined in (A.17) deriving deterministic lower and upper bounds for the denominator and numerator in the Bayes formula. We start with
[TABLE]
and show that on the event this quantity tends to [math] when tends to infinity. The first part of the proof provides a lower bound for the denominator. For that, we restrict and where the last inclusion follows since by definition of the event in (A.17), . The inner integral in the denominator of (A.18) can be lower bounded by
[TABLE]
Thanks to the definition of in (A.17) and we have so that We then set
[TABLE]
Since are compact sets and is continuous and positive, we must have Differentiating (5.6) gives so the function is decreasing on for any As a direct consequence of (5.6), we obtain
[TABLE]
Consequently, for any
[TABLE]
where the last inequality follows since on and for The right hand side does not depend on anymore. To lower bound the first integral in the denominator of (A.18) we apply a similar argument. By (5.6), This means that the function is increasing on and (5.6) yields
[TABLE]
On and therefore Set
[TABLE]
so that because is a compact set and is continuous and positive. We bound
[TABLE]
using that on and for The product of the lower bounds obtained in (A.20) and (A.21) is then a lower bound for the denominator of (A.18).
In the next step we upper bound the numerator of (A.18). Firstly, observe that and
[TABLE]
Secondly, since is increasing on and decreasing on
[TABLE]
The numerator of (A.18) is upper bounded by the product of the bounds obtained in (A.22) and (A.23). Together with the bounds on the denominator in (A.20) and (A.21), and we derive, on the event the following bound for (A.18):
[TABLE]
The convergence to zero follows since by definition of the constant in (A.16), and because of
Along similar lines, we show now that, on the event as tends to infinity. Since and tends to zero by (A.24), it is sufficient to establish convergence of
[TABLE]
to zero. We can argue similarly as for the upper bound above using that By following the same steps as for (A.22) and (A.23) and using that is increasing on and decreasing on the numerator in (5.9) integrated over the set is upper bounded by
[TABLE]
Together with the lower bounds for the denominator in (A.20) and (A.21), we upper bound (A.25), on the event by
[TABLE]
By definition (see (A.16)), the constant satisfies Since this implies that the right hand side of (A.26) is bounded above by as Together with (A.24), this completes the proof for part (ii).
Proof of (iii): It is well-known that for probability measures defined on the same measurable space
[TABLE]
see Lemma E.1 in [26]. With and the distribution with density
[TABLE]
we have that
[TABLE]
By bounding the -distance between the densities, we now show that and are close in total variation using the following lemma.
Lemma A.2** (Lemma E.3 in [26]).**
If exists and for some then also
[TABLE]
As is the Radon-Nikodym derivative up to a multiplicative factor, we can choose
[TABLE]
Then,
[TABLE]
Using the argument above, it remains to prove that for By the definition of and due to
[TABLE]
Recall that is a compact set. Since is positive and uniformly continuous,
[TABLE]
Similarly, we have on the event
[TABLE]
Since for all the average of the squares lies in the convex hull of and
[TABLE]
For real numbers We therefore obtain with (A.28) and (A.30), for This completes the proof of
Proof of (iv): We use the same strategy as in the proof of part applying Lemma A.2 to
[TABLE]
which is a constant multiple of the likelihood ratio of and To verify the assumptions of Lemma A.2, we have to show that for Using again the identity and the fact that we find that it is enough to prove that on the event
[TABLE]
To verify (A.32), differentiating (5.6) gives
[TABLE]
and by a third-order Taylor expansion around the maximum
[TABLE]
for some between and We now control the smaller order terms uniformly over Observe that also With (A.29), and for all Moreover, since is compact, Together this shows that
[TABLE]
establishing (A.32). To prove (A.33) we argue similarly. Differentiating (5.6) gives
[TABLE]
and by a third-order Taylor expansion around the maximum ,
[TABLE]
for some between and If then, on both and are in With (A.29) and (A.31), we have and for sufficiently large Together with the reasoning for (A.32), this leads to
[TABLE]
being bounded by and thus converging to zero.
Proof of (v): Define as the distribution on with density (5.14), that is,
[TABLE]
and as the localization of on that is, the distribution with density
[TABLE]
Here are as defined in (A.15). The marginal distributions of and with respect to are and respectively. Applying (A.27) yields
[TABLE]
To prove it remains to show that for
[TABLE]
By Lemma 5.4, it is enough to prove that on
[TABLE]
for independent Recall that this and all the following statements in (v) should be understood conditionally on
To bound the terms, we heavily rely on the exponential bounds for tail probabilities of Gaussian variables given by Mill’s ratio [17]
[TABLE]
In a first step we derive a lower bound on Using that on the definition of the symmetry properties of the distribution, on and Mill’s ratio, we find
[TABLE]
where in the last inequality we used that for .
We now derive an upper bound for Using the definition of and Mill’s ratio (A.37),
[TABLE]
Next, we obtain a similar bound for If we define the difference of two sets as then, On the event we have that implies that which is equivalent to saying that implies On by definition. Because of we obtain Together with the symmetry properties of the normal distribution, and Mill’s ratio (A.37), this yields
[TABLE]
To prove (A.36), we bound
[TABLE]
and
[TABLE]
Now (A.36) (and therefore (A.35)) follow from the inequalities (A.38), (A.39), (A.40) and the definition of This completes the proof of
Proof of (vi): Recall the definitions of the densities
[TABLE]
and let
[TABLE]
be their localised versions on . It is enough to show that, on
[TABLE]
For (A.41), we apply (A.27) and the fact that is the marginal distribution of finding
[TABLE]
In we proved that the right hand side converges to zero uniformly over For (A.42), we argue similarly, using that
[TABLE]
with Using (A.39), we see that the right hand side converges to zero, uniformly over
For (A.43), we apply Lemma A.2. On the likelihood ratio of and is given by
[TABLE]
On
[TABLE]
Uniformly over and the right hand side can be further upper bounded by for sufficiently large Thus,
[TABLE]
Since for
[TABLE]
This concludes the proof of (vi). ∎
Proof of Theorem 5.2.
We insert in the expectation. Since the total variation distance of probability measures is bounded, the result follows from Proposition A.1. ∎
Proof of Corollary 5.3.
Recall that the posterior is the marginal distribution of with respect to By Proposition A.1 (ii), we have that
[TABLE]
Using that on and we obtain
[TABLE]
for a constant that is chosen to be sufficiently large. The claim follows by splitting the expected posterior, inserting in the expectation and using Proposition A.1 (i). ∎
Proof of Lemma 5.4.
To prove the result, we derive an expression for the joint density of (\xi,\eta-\xi)\big{|}(0\leq\xi\leq\eta). Observe that
[TABLE]
The right hand side is zero if Suppose now that Conditioning on the right hand side can be rewritten as
[TABLE]
Taking derivatives the density of (\xi,\eta-\xi)\big{|}(0\leq\xi\leq\eta) at point equals up to a multiplicative constant Which completes the proof for the case
The case is similar and the proof for this case therefore omitted.
Since the posterior limit distribution is the marginal over the first component of the joint distribution in (5.14), it must coincide with the distribution of \xi\big{|}(0\leq\xi\leq\eta). ∎
Acknowledgment
We are grateful to an anonymous referee and the AE for many helpful suggestions resulting in a major improvement of the article. The research has been supported by an NWO TOP grant.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Azzalini, A. (1985) A class of distributions which includes the normal ones. Scand. J. Statist. , 12 (2) , 171–178. \MR MR 808153
- 2[2] Azzalini, A. and Dalla Valle, A. (1996) The multivariate skew-normal distribution. Biometrika , 83 (4) , 715–726. \MR MR 1440039
- 3[3] Bickel, P. J. and Kleijn J. K. (2012) The semiparametric Bernstein-von Mises theorem. Ann. Statist. , 40 (1) , 206–237. \MR MR 3013185
- 4[4] Castillo, I. (2008) Lower bounds for posterior rates with Gaussian process priors. Electron. J. Stat. , 2 , 1281–1299. \MR MR 2471287
- 5[5] Castillo, I. (2012) A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probab. Theory Related Fields , 152 , 53–99. \MR MR 2875753
- 6[6] Castillo, I. (2012) Semiparametric Bernstein–von Mises theorem and bias, illustrated with Gaussian process priors. Sankhya A , 74 (2) , 194–221. \MR MR 3021557
- 7[7] Castillo, I. and Rousseau. J. (2015) A Bernstein–von Mises theorem for smooth functionals in semiparametric models. Ann. Statist. , 43 (6) , 2353–2383. \MR MR 3405597
- 8[8] Castillo, I. and van der Vaart, A. (2012) Lower bounds for posterior rates with Gaussian process priors. Ann. Statist. , 40 (4) , 2069–2101. \MR MR 3059077
