Bayesian variance estimation in the Gaussian sequence model with partial   information on the means

Gianluca Finocchio; Johannes Schmidt-Hieber

arXiv:1904.04525·math.ST·December 19, 2019

Bayesian variance estimation in the Gaussian sequence model with partial information on the means

Gianluca Finocchio, Johannes Schmidt-Hieber

PDF

TL;DR

This paper investigates Bayesian variance estimation in the Gaussian sequence model with partial mean information, revealing that certain priors lead to inconsistent posteriors while hierarchical Gaussian mixture priors achieve consistency and improved estimators.

Contribution

It demonstrates the inconsistency of the marginal posterior under i.i.d. priors and introduces hierarchical Gaussian mixture priors that ensure consistency and better estimation performance.

Findings

01

Posterior is inconsistent for i.i.d. priors on means.

02

Hierarchical Gaussian mixture priors achieve posterior consistency.

03

Bayesian estimators outperform the MLE in simulations.

Abstract

Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for $σ^{2}$ is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By developing a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises…

Tables1

Table 1. Table 1: Comparison of the estimators for ( σ 0 2 , μ 0 ) = ( 1 , ( t / n 1 / 4 , … , t / n 1 / 4 ) ) superscript subscript 𝜎 0 2 subscript 𝜇 0 1 𝑡 superscript 𝑛 1 4 … 𝑡 superscript 𝑛 1 4 (\sigma_{0}^{2},\mu_{0})=(1,(t/n^{1/4},\ldots,t/n^{1/4})) and t ∈ { 0 , 1 , 2 , 5 } . 𝑡 0 1 2 5 t\in\{0,1,2,5\}.

Estim.	$n$	$0$	$1$	$2$	$5$
	10	0.414 ( $\pm$ 8.7e-03)	0.411 ( $\pm$ 8.6e-03)	0.386 ( $\pm$ 8.2e-03)	0.399 ( $\pm$ 8.4e-03)
${\hat{σ}}_{Y}^{2}$	100	0.040 ( $\pm$ 5.9e-04)	0.040 ( $\pm$ 5.9e-04)	0.390 ( $\pm$ 5.7e-04)	0.041 ( $\pm$ 6.4e-04)
	1000	0.004 ( $\pm$ 5.7e-05)	0.004 ( $\pm$ 5.6e-05)	0.004 ( $\pm$ 5.8e-05)	0.004 ( $\pm$ 5.8e-05)
	10	0.235 ( $\pm$ 3.1e-03)	0.268 ( $\pm$ 4.2e-03)	0.336 ( $\pm$ 6.2e-03)	0.399 ( $\pm$ 8.4e-03)
${\tilde{σ}}^{2}$	100	0.028 ( $\pm$ 3.8e-04)	0.031 ( $\pm$ 4.2e-04)	0.037 ( $\pm$ 5.2e-04)	0.041 ( $\pm$ 6.4e-05)
	1000	0.003 ( $\pm$ 4.3e-05)	0.003 ( $\pm$ 4.4e-05)	0.004 ( $\pm$ 5.4e-05)	0.004 ( $\pm$ 5.8e-05)
	10	0.337 ( $\pm$ 3.3e-03)	0.330 ( $\pm$ 4.6e-03)	0.359 ( $\pm$ 6.9e-03)	0.398 ( $\pm$ 8.3e-03)
${\hat{σ}}_{map, \infty}^{2}$	100	0.036 ( $\pm$ 4.3e-04)	0.032 ( $\pm$ 4.2e-04)	0.034 ( $\pm$ 4.7e-04)	0.041 ( $\pm$ 6.3e-04)
	1000	0.003 ( $\pm$ 4.9e-05)	0.003 ( $\pm$ 4.5e-05)	0.003 ( $\pm$ 4.9e-05)	0.004 ( $\pm$ 5.8e-05)
	10	0.167 ( $\pm$ 2.1e-03)	0.182 ( $\pm$ 3.8e-03)	0.232 ( $\pm$ 5.9e-03)	0.283 ( $\pm$ 7.0e-03)
${\hat{σ}}_{mean, \infty}^{2}$	100	0.040 ( $\pm$ 4.5e-04)	0.034 ( $\pm$ 4.3e-04)	0.034 ( $\pm$ 4.7e-04)	0.041 ( $\pm$ 6.2e-04)
	1000	0.004 ( $\pm$ 5.1e-05)	0.003 ( $\pm$ 4.6e-05)	0.003 ( $\pm$ 4.9e-05)	0.004 ( $\pm$ 5.8e-05)

Equations323

\displaystyle X_{i}\sim\mathcal{N}\big{(}\mu_{i}^{0}\mathbf{1}(i>n\alpha),\sigma_{0}^{2}\big{)},\quad i=1,\ldots,n.

\displaystyle X_{i}\sim\mathcal{N}\big{(}\mu_{i}^{0}\mathbf{1}(i>n\alpha),\sigma_{0}^{2}\big{)},\quad i=1,\ldots,n.

\begin{split}L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}&=\underbrace{\frac{1}{(2\pi\sigma^{2})^{n_{1}/2}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}}_{L(\sigma^{2},\mu|Y)}\underbrace{\frac{1}{(2\pi\sigma^{2})^{n_{2}/2}}e^{-\frac{\|Z-\mu\|^{2}}{2\sigma^{2}}}}_{L(\sigma^{2},\mu|Z)}\\ &=\frac{1}{(2\pi\sigma^{2})^{n/2}}e^{-\frac{\|Y\|^{2}+\|Z-\mu\|^{2}}{2\sigma^{2}}}.\end{split}

\begin{split}L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}&=\underbrace{\frac{1}{(2\pi\sigma^{2})^{n_{1}/2}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}}_{L(\sigma^{2},\mu|Y)}\underbrace{\frac{1}{(2\pi\sigma^{2})^{n_{2}/2}}e^{-\frac{\|Z-\mu\|^{2}}{2\sigma^{2}}}}_{L(\sigma^{2},\mu|Z)}\\ &=\frac{1}{(2\pi\sigma^{2})^{n/2}}e^{-\frac{\|Y\|^{2}+\|Z-\mu\|^{2}}{2\sigma^{2}}}.\end{split}

\begin{split}\big{(}\widehat{\sigma}_{\operatorname{mle}}^{2},\widehat{\mu}_{\operatorname{mle}}\big{)}=\Big{(}\frac{\|Y\|^{2}}{n},Z\Big{)}.\end{split}

\begin{split}\big{(}\widehat{\sigma}_{\operatorname{mle}}^{2},\widehat{\mu}_{\operatorname{mle}}\big{)}=\Big{(}\frac{\|Y\|^{2}}{n},Z\Big{)}.\end{split}

σ_{mle}^{2} = \frac{n _{1}}{n} σ_{Y, mle}^{2} + \frac{n _{2}}{n} σ_{Z, mle}^{2},

σ_{mle}^{2} = \frac{n _{1}}{n} σ_{Y, mle}^{2} + \frac{n _{2}}{n} σ_{Z, mle}^{2},

\displaystyle\sigma^{2}\mapsto L\big{(}\sigma^{2},\widehat{\mu}_{\sigma^{2}}\big{|}Y,Z\big{)}.

\displaystyle\sigma^{2}\mapsto L\big{(}\sigma^{2},\widehat{\mu}_{\sigma^{2}}\big{|}Y,Z\big{)}.

\displaystyle E\Big{[}\frac{\partial^{2}}{\partial\sigma^{2}\partial\mu_{j}}\log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}\Big{]}=0,\quad\text{for all }j

\displaystyle E\Big{[}\frac{\partial^{2}}{\partial\sigma^{2}\partial\mu_{j}}\log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}\Big{]}=0,\quad\text{for all }j

\displaystyle\sigma^{2}\mapsto\mathcal{L}(\sigma^{2}):=\det\big{(}M(\sigma^{2},\widehat{\mu}_{\sigma^{2}})\big{)}^{-1/2}L\big{(}\sigma^{2},\widehat{\mu}_{\sigma^{2}}\big{|}Y,Z\big{)}

\displaystyle\sigma^{2}\mapsto\mathcal{L}(\sigma^{2}):=\det\big{(}M(\sigma^{2},\widehat{\mu}_{\sigma^{2}})\big{)}^{-1/2}L\big{(}\sigma^{2},\widehat{\mu}_{\sigma^{2}}\big{|}Y,Z\big{)}

\displaystyle M(\sigma^{2},\mu):=\Big{(}-\frac{\partial^{2}}{\partial\mu_{j}\partial\mu_{\ell}}\log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}\Big{)}_{j,\ell}

\displaystyle M(\sigma^{2},\mu):=\Big{(}-\frac{\partial^{2}}{\partial\mu_{j}\partial\mu_{\ell}}\log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}\Big{)}_{j,\ell}

σ^{2} = \frac{∥ Y ∥ ^{2}}{n _{1}} .

σ^{2} = \frac{∥ Y ∥ ^{2}}{n _{1}} .

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&=\frac{L(\sigma^{2}|Y,Z)\pi(\sigma^{2})}{\int_{\mathbb{R}_{+}}L(\sigma^{2}|Y,Z)\pi(\sigma^{2})\,d\sigma^{2}},\end{split}

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&=\frac{L(\sigma^{2}|Y,Z)\pi(\sigma^{2})}{\int_{\mathbb{R}_{+}}L(\sigma^{2}|Y,Z)\pi(\sigma^{2})\,d\sigma^{2}},\end{split}

\begin{split}L(\sigma^{2}|Y,Z)=\sigma^{-n}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}\Big{(}\int_{\mathbb{R}^{n}}e^{-\frac{\|Z-\mu\|^{2}}{2\sigma^{2}}}d\nu(\mu)\Big{)}.\end{split}

\begin{split}L(\sigma^{2}|Y,Z)=\sigma^{-n}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}\Big{(}\int_{\mathbb{R}^{n}}e^{-\frac{\|Z-\mu\|^{2}}{2\sigma^{2}}}d\nu(\mu)\Big{)}.\end{split}

\displaystyle\begin{split}L(\sigma^{2}|Y,Z)=\mathcal{L}(\sigma^{2})\nu\big{(}\widehat{\mu}_{\sigma^{2}}\big{)}\big{(}1+O_{P}(n^{-1})\big{)}=\mathcal{L}(\sigma^{2})\nu\big{(}Z\big{)}\big{(}1+O_{P}(n^{-1})\big{)},\end{split}

\displaystyle\begin{split}L(\sigma^{2}|Y,Z)=\mathcal{L}(\sigma^{2})\nu\big{(}\widehat{\mu}_{\sigma^{2}}\big{)}\big{(}1+O_{P}(n^{-1})\big{)}=\mathcal{L}(\sigma^{2})\nu\big{(}Z\big{)}\big{(}1+O_{P}(n^{-1})\big{)},\end{split}

Y_{i} \sim N (0, σ_{0}^{2}), i = 1, \dots, n_{1} and Z_{i} ∣ μ \sim N (μ_{i}, σ_{0}^{2}), i = n_{1} + 1, \dots, n,

Y_{i} \sim N (0, σ_{0}^{2}), i = 1, \dots, n_{1} and Z_{i} ∣ μ \sim N (μ_{i}, σ_{0}^{2}), i = n_{1} + 1, \dots, n,

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&\propto\sigma^{-n_{1}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}\pi(\sigma^{2}).\end{split}

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&\propto\sigma^{-n_{1}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}\pi(\sigma^{2}).\end{split}

d Π (μ ∣ Z, σ^{2}) = \frac{e ^{- \frac{∥ Z - μ ∥ ^{2}}{2 σ ^{2}}} d ν ( μ )}{\int _{R^{n}} e ^{- \frac{∥ Z - μ ∥ ^{2}}{2 σ ^{2}}} d ν ( μ )} .

d Π (μ ∣ Z, σ^{2}) = \frac{e ^{- \frac{∥ Z - μ ∥ ^{2}}{2 σ ^{2}}} d ν ( μ )}{\int _{R^{n}} e ^{- \frac{∥ Z - μ ∥ ^{2}}{2 σ ^{2}}} d ν ( μ )} .

\displaystyle V\big{(}\mu|(Z,\sigma^{2})\big{)}:=\int_{\mathbb{R}^{n}}\|Z-\mu\|^{2}d\Pi(\mu|Z,\sigma^{2}).

\displaystyle V\big{(}\mu|(Z,\sigma^{2})\big{)}:=\int_{\mathbb{R}^{n}}\|Z-\mu\|^{2}d\Pi(\mu|Z,\sigma^{2}).

\partial_{σ^{2}} lo g \frac{π ( σ ^{2} ∣ Y , Z )}{π ( σ ^{2} )} = \partial_{σ^{2}} lo g L (σ^{2} ∣ Y, Z) = \frac{∥ Y ∥ ^{2} + V ( μ ∣ ( Z , σ ^{2} ))}{2 σ ^{4}} - \frac{n}{2 σ ^{2}} .

\partial_{σ^{2}} lo g \frac{π ( σ ^{2} ∣ Y , Z )}{π ( σ ^{2} )} = \partial_{σ^{2}} lo g L (σ^{2} ∣ Y, Z) = \frac{∥ Y ∥ ^{2} + V ( μ ∣ ( Z , σ ^{2} ))}{2 σ ^{4}} - \frac{n}{2 σ ^{2}} .

\frac{V ( μ ∣ ( Z , σ _{0}^{2} ))}{n} = (1 - α) σ_{0}^{2} + o_{P} (1) .

\frac{V ( μ ∣ ( Z , σ _{0}^{2} ))}{n} = (1 - α) σ_{0}^{2} + o_{P} (1) .

Y_{i} \sim N (0, σ_{0}^{2}), i = 1, \dots, n_{1} and Z_{i} ∣ μ_{i} \sim N (μ_{i}, σ_{0}^{2}), i = n_{1} + 1, \dots, n,

Y_{i} \sim N (0, σ_{0}^{2}), i = 1, \dots, n_{1} and Z_{i} ∣ μ_{i} \sim N (μ_{i}, σ_{0}^{2}), i = n_{1} + 1, \dots, n,

\begin{split}\lim_{n\to\infty}P_{0}^{n}\Big{(}\Big{\{}\partial_{\sigma^{2}}\log\pi(\sigma^{2}|Y,Z)\geq\sigma_{0}^{-2}n,\ \forall\sigma^{2}\in\Big{[}\frac{\sigma_{0}^{2}}{2},2\sigma_{0}^{2}\Big{]}\Big{\}}\Big{)}=1.\end{split}

\begin{split}\lim_{n\to\infty}P_{0}^{n}\Big{(}\Big{\{}\partial_{\sigma^{2}}\log\pi(\sigma^{2}|Y,Z)\geq\sigma_{0}^{-2}n,\ \forall\sigma^{2}\in\Big{[}\frac{\sigma_{0}^{2}}{2},2\sigma_{0}^{2}\Big{]}\Big{\}}\Big{)}=1.\end{split}

\begin{split}\lim_{n\to\infty}E_{0}^{n}\Big{[}\Pi\Big{(}\Big{|}\frac{\sigma^{2}}{\sigma_{0}^{2}}-1\Big{|}\leq\frac{1}{2}\Big{|}Y,Z\Big{)}\Big{]}=0.\end{split}

\begin{split}\lim_{n\to\infty}E_{0}^{n}\Big{[}\Pi\Big{(}\Big{|}\frac{\sigma^{2}}{\sigma_{0}^{2}}-1\Big{|}\leq\frac{1}{2}\Big{|}Y,Z\Big{)}\Big{]}=0.\end{split}

\displaystyle\sup_{\sigma_{0}^{2}>0}E_{0}^{n}\Big{[}\Big{|}\frac{\overline{Y^{2}}}{\sigma_{0}^{2}}-1\Big{|}\Big{]}\lesssim n^{-1/2}.

\displaystyle\sup_{\sigma_{0}^{2}>0}E_{0}^{n}\Big{[}\Big{|}\frac{\overline{Y^{2}}}{\sigma_{0}^{2}}-1\Big{|}\Big{]}\lesssim n^{-1/2}.

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

\partial_{σ^{2}} lo g π (σ^{2} ∣ Y, Z)

μ_{i} \sim N (0, θ^{2}), independently.

μ_{i} \sim N (0, θ^{2}), independently.

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&\propto\sigma^{-n_{1}}(\theta^{2}+\sigma^{2})^{-\frac{n_{2}}{2}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}e^{-\frac{\|Z\|^{2}}{2(\theta^{2}+\sigma^{2})}}\pi(\sigma^{2}),\end{split}

\begin{split}\pi\big{(}\sigma^{2}\big{|}Y,Z\big{)}&\propto\sigma^{-n_{1}}(\theta^{2}+\sigma^{2})^{-\frac{n_{2}}{2}}e^{-\frac{\|Y\|^{2}}{2\sigma^{2}}}e^{-\frac{\|Z\|^{2}}{2(\theta^{2}+\sigma^{2})}}\pi(\sigma^{2}),\end{split}

\begin{split}\widehat{\sigma}^{2}-\overline{Y^{2}}&=\frac{n_{2}}{n_{1}}\bigg{(}\frac{\widehat{\sigma}^{2}}{\theta^{2}+\widehat{\sigma}^{2}}\bigg{)}^{2}\big{[}\overline{Z^{2}}-\theta^{2}-\widehat{\sigma}^{2}\big{]}.\end{split}

\begin{split}\widehat{\sigma}^{2}-\overline{Y^{2}}&=\frac{n_{2}}{n_{1}}\bigg{(}\frac{\widehat{\sigma}^{2}}{\theta^{2}+\widehat{\sigma}^{2}}\bigg{)}^{2}\big{[}\overline{Z^{2}}-\theta^{2}-\widehat{\sigma}^{2}\big{]}.\end{split}

\begin{split}&\widehat{\sigma}^{2}-\sigma_{0}^{2}+O_{P}(n^{-1/2})\\ &=\frac{1-\alpha}{\alpha}\big{(}1+O(n^{-1})\big{)}\bigg{(}\frac{\widehat{\sigma}^{2}}{\theta^{2}+\widehat{\sigma}^{2}}\bigg{)}^{2}\bigg{[}\sigma_{0}^{2}-\widehat{\sigma}^{2}+\overline{\mu_{0}^{2}}+O_{P}(n^{-1/2})-\theta^{2}\bigg{]},\end{split}

\begin{split}&\widehat{\sigma}^{2}-\sigma_{0}^{2}+O_{P}(n^{-1/2})\\ &=\frac{1-\alpha}{\alpha}\big{(}1+O(n^{-1})\big{)}\bigg{(}\frac{\widehat{\sigma}^{2}}{\theta^{2}+\widehat{\sigma}^{2}}\bigg{)}^{2}\bigg{[}\sigma_{0}^{2}-\widehat{\sigma}^{2}+\overline{\mu_{0}^{2}}+O_{P}(n^{-1/2})-\theta^{2}\bigg{]},\end{split}

\overline{μ_{0}^{2}} = ∥ μ_{0} ∥^{2} / n_{2}

\overline{μ_{0}^{2}} = ∥ μ_{0} ∥^{2} / n_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bayesian variance estimation in the Gaussian sequence model with partial information on the means

Gianluca Finocchiolabel=e1 [

mark][email protected]

Johannes Schmidt-Hieberlabel=e2 [

mark][email protected]

University of Twente,

University of Twente

Abstract

Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for $\sigma^{2}$ is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By developing a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises theorem, the limit is non-Gaussian. We show that the Bayesian analysis leads to new statistical estimators outperforming the correctly calibrated MLE in a numerical simulation study.

frequentist Bayes,

maximum likelihood,

semiparametric inference,

Gaussian sequence model,

Bernstein-von Mises theorems,

keywords:

\arxiv

arXiv:1904.04525 \startlocaldefs

\endlocaldefs

and

1 Introduction

For given $0\leq\alpha\leq 1,$ suppose we observe $n$ independent and normally distributed random variables

[TABLE]

The parameters in the model are $\mu_{i}^{0},$ $i>n\alpha$ and $\sigma_{0}>0.$ The goal is to estimate the variance $\sigma_{0}^{2}$ while treating the mean vector $\mu_{0}:=(\mu_{\lceil n\alpha\rceil}^{0},\dots,\mu_{n}^{0})$ as nuisance. For $\alpha=0,$ we recover the Gaussian sequence model. For $\alpha>0,$ this can be viewed as the Gaussian sequence model with additional knowledge that the means of the first $\lfloor n\alpha\rfloor$ observations are known (in which case we can subtract them from the data).

One can think of model (1.1) as a simple prototype of a combined dataset. Using for instance different measurement devices, one often faces merged datasets collected from multiple sources. The different sources might not be of the same quality concerning the underlying parameter, see [24] for an example. An alternative viewpoint is to interpret model (1.1) as a sparse sequence model with known support. Since a $(1-\alpha)$ -fraction of the data is perturbed, we are in the dense regime. Knowledge of the support is then crucial as otherwise there is no consistent estimator for $\sigma_{0}^{2}.$

If $n$ is even and $\alpha=1/2,$ then (1.1) is equivalent to the Neyman-Scott model [25] up to a reparametrization. Model (1.1) is in this case equivalent to observing $U_{i}:=(X_{n/2+i}+X_{i})$ and $V_{i}:=(X_{n/2+i}-X_{i})$ for $i=1,\ldots,n/2.$ Since $U_{i}$ and $V_{i}$ are independent, this is thus equivalent to observing independent random variables $U_{i},V_{i}\sim\mathcal{N}(\mu_{n/2+i}^{0},\widetilde{\sigma}_{0}^{2}),$ with $\widetilde{\sigma}_{0}^{2}=2\sigma_{0}^{2}.$ Estimation of $\widetilde{\sigma}_{0}^{2}$ in the latter model is known as Neyman-Scott problem.

Although $\sigma_{0}^{2}$ can be estimated with parametric rate based on the first $n\alpha$ observations, a striking feature of the model is that the MLE for $\sigma_{0}^{2}$ is inconsistent. In fact the MLE $\widehat{\sigma}^{2}_{\operatorname{mle}}$ converges to $\alpha\sigma_{0}^{2}$ therefore underestimating the true variance by the factor $\alpha.$ The reason is that the likelihood of the observations with non-zero mean significantly affects the total likelihood viewed as a function in $\sigma^{2}.$

We study what happens when a Bayesian approach is implemented for the estimation of the variance and whether the posterior distribution can correct for the bias of the MLE. The Bayesian method can be viewed as a weighted likelihood method: instead of taking the parameter with the largest likelihood, the posterior puts mass on parameter sets with large likelihood. Because of this, the posterior can in some cases correct the flaws of the MLE. An example are irregular models, see [15, 11, 26].

In the first part of the paper, we prove that whenever the nuisances are independently generated from a proper distribution, the posterior does not contract around the true variance. This shows that, for a large class of natural priors, the Bayesian method is unable to correct the MLE. In frequentist Bayes, several lower bound techniques have been developed in order to describe when Bayesian methods do not work, [4, 8, 9, 29, 10, 19]. These results can be used for instance to show that a certain decay of the prior is necessary to ensure posterior contraction. Our lower bounds are of a different flavor and do not require a condition on the tail behavior.

Since for the non-zero means no additional structure is assumed, there is no way to say something about one mean from knowledge of all the other means. Therefore, one might be tempted to think that a correlated prior on the means cannot perform better than an i.i.d. prior and consequently must lead to an inconsistent posterior as well. Surprisingly, this is not true and we construct in the second part of the article a Gaussian mixture prior for which the posterior contracts with the parametric rate around the true variance. For this prior we derive the limit distribution in the Bernstein-von Mises sense. In contrast with the classical Bernstein-von Mises theorem, the posterior limit is non-Gaussian in the case of small means. In this case the posterior also incorporates information about the second part of the sample into the estimator and we show in a simulation study that the maximum a posteriori estimate based on the limit distribution outperforms the $\sqrt{n}$ -consistent estimator that only uses the observations with zero mean.

Estimation of the variance in model (1.1) can also be interpreted as a semiparametric problem. The results in this article therefore contribute to the recent efforts to understand frequentist Bayes in semiparametric models. Semiparametric Bernstein-von Mises theorems are derived under various conditions in [27, 5, 3, 7]. For specific priors, it has been observed that there can be a large bias in the posterior limit, see [6, 7, 26]. In all the cases studied so far, it is unclear whether the bias is due to the specific choice of prior or whether this is a fundamental limitation of the Bayesian method. To the best of our knowledge, our results show for the first time, that the posterior can be inconsistent for all natural priors.

Related to model (1.1), [14] studies Bayes for variance estimation of the errors in the nonparametric regression model. It is shown that if the posterior contracts around the true regression function with rate $o(n^{-1/4}),$ the marginal posterior for the variance contracts with parametric rate around the true error variance and a Bernstein-von Mises result holds.

The article is organized as follows. In Section 2, we discuss aspects of the problem related to the likelihood and the posterior distribution. A crucial identity for the log-posterior is derived in Section 3. This leads then to the general negative result in Section 4. The Gaussian mixture prior with parametric posterior contraction is constructed in Section 5. This section also contains the limiting shape result and a numerical simulation study. All proofs are deferred to the appendix.

Notation: For a vector $u=(u_{1},\ldots,u_{k})$ , we write $\|u\|^{2}=\sum_{i=1}^{k}u_{i}^{2}$ and $\overline{u^{2}}=\|u\|^{2}/k$ for the averages of the squares (not to be confused with the squared averages). We write $n_{1}=\lfloor n\alpha\rfloor$ and $n_{2}=n-n_{1}$ . The probability and expectation induced by model (1.1) are denoted by $P_{0}^{n}$ and $E_{0}^{n}.$

2 Likelihood and posterior

The MLE. For the subsequent analysis, it is convenient to split the data vector $X=(X_{1},\ldots,X_{n})$ in the part with zero means $Y=(X_{1},\ldots,X_{n_{1}})$ and the observations with non-zero means $Z=(X_{n_{1}+1},\ldots,X_{n})$ such that $X=(Y,Z).$ The likelihood function of the model is

[TABLE]

Maximizing over $(\sigma^{2},\mu)$ yields the MLE

[TABLE]

If only based on the subsample $Y,$ the MLE for $\sigma_{0}^{2}$ would be $\|Y\|^{2}/n_{1}$ and this converges to $\sigma_{0}^{2}$ with the parametric rate $n^{-1/2}.$ Hence $\|Y\|^{2}/n$ converges to $\alpha\sigma_{0}^{2}.$ The MLE for $\sigma_{0}^{2}$ is therefore inconsistent and misses the true parameter $\sigma_{0}^{2}$ by a factor $\alpha.$ It is clear that there is very little extractable information about the parameter $\sigma_{0}^{2}$ in $Z.$ A frequentist estimator can simply discard $Z$ and only use the subsample $Y.$ The MLE also does this but leads to an incorrect scaling of the estimator.

The incorrect scaling factor of the MLE can be explained in different ways. One interpretation is that the MLE can be written as

[TABLE]

with $\widehat{\sigma}_{Y,\operatorname{mle}}^{2}=\|Y\|^{2}/n_{1}$ the MLE based on the subsample $Y$ and $\widehat{\sigma}_{Z,\operatorname{mle}}^{2}=0$ the MLE based on the subsample $Z.$ The fact that the overall MLE just forms a linear combination of the MLEs for the subsamples shows again that too much weight is given to $Z.$

Another explanation for the incorrect scaling of the MLE is to observe that in (2.1) the likelihood based on the second subsample is $L(\sigma^{2},\mu|Z)\propto\sigma^{-n_{2}}$ if $\mu=\widehat{\mu}_{\operatorname{mle}}.$ If we would take the likelihood only over the first part of the sample $Y$ we would obtain the optimal estimator $\|Y\|^{2}/n_{1},$ but since the likelihhod over the full sample is the product of the likelihood functions for $Y$ and $Z,$ an additional factor $\sigma^{-n_{2}}$ occurs in the overall likelihood which leads to the incorrect scaling. More generally, we conjecture that likelihood methods do not perform well for combined datasets where one part of the data is informative about a parameter and the other part is affected by nuisance parameters.

Adjusted profile likelihood. For the profile likelihood, we first compute the maximum likelihood estimator of the nuisance parameter for fixed $\sigma^{2},$ denoted by, say $\widehat{\mu}_{\sigma^{2}},$ and then maximize

[TABLE]

Obviously $\widehat{\mu}_{\sigma^{2}}=Z$ for any $\sigma^{2}>0$ and the profile likelihood estimator coincides with the MLE for $\sigma^{2}$ in the Neyman-Scott problem. If the parameter of interest and the nuisance parameters are orthogonal with respect to the Fisher information, that is,

[TABLE]

the adjusted profile likelihood estimator [12, 23, 13] is the maximizer of

[TABLE]

for the matrix valued function

[TABLE]

and $\det()$ the determinant. It is easy to check that (2.3) holds for model (1.1). Since $-\partial^{2}/(\partial\mu_{j}\partial\mu_{\ell})\ \log L\big{(}\sigma^{2},\mu\big{|}Y,Z\big{)}=\sigma^{-2}\mathbf{1}(j=\ell),$ the adjusted profile likelihood estimator for $\sigma^{2}$ coincides with the MLE for the subsample $Y,$

[TABLE]

In particular, the adjusted profile likelihood results in an unbiased $\sqrt{n}$ -consistent estimator for $\sigma^{2}.$

The posterior distribution. From a Bayesian perspective it is quite natural to draw $\sigma^{2}$ and the mean vector $\mu$ from independent distributions. Due to the orthogonality with respect to the Fisher information (2.3), we expect no strong interactions of $\sigma^{2}$ and the mean parameters in the likelihood that could be taken care of by a dependent prior. Suppose that $\mu\sim\nu$ and that the prior for $\sigma^{2}$ has Lebesgue density $\pi.$ The marginal posterior distribution is then given by Bayes formula

[TABLE]

with

[TABLE]

In [28] it has been argued that by using multivariate Laplace approximation,

[TABLE]

with $\mathcal{L}(\sigma^{2})$ the adjusted profile likelihood in (2.4). This suggests that the posterior distribution should be centered around the adjusted profile likelihood estimator $\|Y\|^{2}/n_{1},$ therefore correcting the MLE.

Associated sequence model with random means. For the Gaussian sequence model with partial information (1.1) equipped with the product prior $\pi\otimes\nu,$ define the associated sequence model with random means, where we observe independent random variables

[TABLE]

with $\mu\sim\nu$ and $\nu$ known. In this model, the nuisance parameters are replaced by additional randomness. The only parameter in this model is $\sigma_{0}^{2}$ and the model is therefore parametric.

*Remark 2.1**.*

The likelihood function of model (2.8) is $L(\sigma^{2}|Y,Z).$ Model (1.1) and model (2.8) lead therefore to the same formula for the posterior distribution of $\sigma^{2}$ in terms of $Y,Z.$

Bayes with improper uniform prior. If the prior on the mean vector in the Bayes formula is chosen as the Lebesgue measure, the formula for the posterior simplifies to

[TABLE]

This is the same posterior we would get if we discarded the subsample $Z.$ It follows from the parametric Bernstein-von Mises theorem that if $\pi$ is positive and continuous in a neighbourhood of $\sigma_{0}^{2},$ the posterior contracts around the true variance $\sigma_{0}^{2}.$ Notice that in the case of uniform prior, the Laplace approximation in (2.7) is exact and does not involve any remainder terms. Obviously the Lebesgue measure is not a probability measure and the prior is improper. This raises then the question whether there are also proper priors for which the marginal posterior is consistent on the whole parameter space. We will address this problem in the next sections.

3 On the derivative of the log-posterior

We first derive a differential equation for the posterior. Denote by $\mu|(Z,\sigma^{2})$ the posterior distribution of $\mu$ for the sample $Z,$ that is,

[TABLE]

In particular, we set

[TABLE]

The quantity $V(\mu|(Z,\sigma^{2}))$ measures the spread of $\Pi(\mu|Z,\sigma^{2})$ around the vector $Z.$ Recall moreover the definition of $L(\sigma^{2}|Y,Z)$ in (2.6).

Proposition 3.1.

The marginal posterior satisfies

[TABLE]

By Remark 2.1, the right hand side is a closed-form expression of the score function for $\sigma^{2}$ in the random means model (2.8). If the MLE in (2.8) does not lie on the boundary, the score function vanishes at the MLE. From the Bernstein-van Mises phenomenon it is conceivable that the posterior will concentrate around this MLE. For the MLE to be close to the truth $\sigma_{0}^{2},$ the score function evaluated at $\sigma_{0}^{2}$ must be $o_{P}(1).$ Since $\|Y\|^{2}=n\alpha\sigma_{0}^{2}+O_{P}(\sqrt{n}),$ this leads to the condition

[TABLE]

In the next section, we derive a very general negative result. The main part of the argument is to show that the previous equality does not hold in a neighborhood of $\sigma_{0}^{2},$ see (A.12).

4 Posterior inconsistency for product priors

In this section we study posterior contraction under the following condition.

Prior. The prior on $\mu$ is independent of the prior on $\sigma^{2}.$ Under the prior, each component of the mean vector $\mu$ is drawn independently from a distribution $\nu$ on $\mathbb{R}.$ The prior on $\sigma^{2}$ has a positive and continuously differentiable Lebesgue density on $\mathbb{R}_{+}.$

So far, $\nu$ denoted the prior on the mean vector. By a slight abuse of language we denote the prior on the individual components also by $\nu.$ The assumptions on the prior are mild enough to account for proper priors with heavy tails and possibly no moments.

The i.i.d. prior is the natural choice, if we believe that there is no structure in the non-zero means. From (2.8) it follows that the corresponding sequence model with random means is

[TABLE]

with $\mu_{i}\sim\nu.$ For $\alpha=1/2$ and unknown $\nu,$ this model has been studied in [21]. It is shown that the MLE for $\sigma_{0}^{2}$ and the MLE for the distribution function of the means are consistent. Since the random means model leads to the same posterior distribution as explained in Remark 2.1, this suggests that the posterior might concentrate around the truth.

We now provide a second heuristic that leads to a different conclusion indicating that it makes a huge difference whether the distribution of the means $\nu$ is known or unknown. In the framework of (4.1), $\nu$ is known. If $\int u^{2}d\nu(u)<\infty,$ then $\overline{\mu^{2}}=\int u^{2}d\nu(u)+O_{P}(n^{-1/2})$ and $\overline{Z^{2}}=\overline{\mu^{2}}+\sigma_{0}^{2}+O_{P}(n^{-1/2}),$ so we have $\overline{Z^{2}}-\int u^{2}d\nu(u)=\sigma_{0}^{2}+O_{P}(n^{-1/2}).$ This means that model (4.1) carries a lot of information about $\sigma_{0}^{2}$ in the sense that $\sigma_{0}^{2}$ can be estimated with parametric rate from the subsample $Z$ only. Since the posterior only sees model (4.1) it is therefore natural to give a lot of weight to the subsample $Z$ as well, which, from a frequentist perspective, is wrong.

This heuristic does not say anything about heavy-tailed priors with $\int u^{2}d\nu(u)=\infty.$ But even in this case, we will show that the posterior is inconsistent. The first result states that in a neighborhood of $\sigma_{0}^{2}$ the posterior is increasing extremely fast with high probability.

Proposition 4.1.

Given $\alpha<1$ and the prior above, then, for all sufficiently large $\sigma_{0}^{2},$ there exists a mean vector $\mu_{0},$ such that

[TABLE]

The proof of Proposition 4.1 constructs a lower bound on $\sigma_{0}^{2}$ that is independent of $n$ and moreover guarantees that $\nu$ has sufficiently small mass outside $[-\sigma_{0}^{2},\sigma_{0}^{2}].$ It therefore depends on the tail behavior of the prior mean distribution $\nu.$ The mean vector $\mu_{0}$ is subsequently chosen with all means being equal to an expression only depending on $\sigma_{0}^{2}.$ Thus the means in $\mu_{0}$ are uniformly bounded and independent of $n$ as well.

Suppose that almost all posterior mass is close to $\sigma_{0}^{2}.$ By the previous proposition, the posterior is increasing at least up to $2\sigma_{0}^{2}.$ Hence, there must be even more mass around $2\sigma_{0}^{2}.$ This is a contradiction and shows that the posterior does not concentrate around $\sigma_{0}^{2}.$ The proof of the next theorem is based on this argument. For this result, the means in the vector $\mu_{0}$ can again be chosen to be uniformly bounded.

Theorem 4.2.

Given $\alpha<1$ and the prior above, then, for all sufficiently large $\sigma_{0}^{2},$ there exists a mean vector $\mu_{0}$ such that

[TABLE]

Consequently, the posterior is inconsistent and assigns all its mass outside of a neighbourhood of the true variance.

The posterior is therefore inferior if compared to the frequentist variance estimator $\overline{Y^{2}},$ which achieves the parametric rate $n^{-1/2}$ in the sense that

[TABLE]

It is remarkable that no conditions on the tail behavior of the prior distribution $\nu$ are required for Theorem 4.2. Recall that for the improper uniform prior the posterior contracts around $\sigma_{0}^{2}.$ This shows that for distributions with heavy tailed densities, very sharp bounds are required.

To the best of our knowledge there are no negative results in the nonparametric Bayes literature that hold for such a large class of priors. The proof strategy to establish Proposition 4.1 is based on a highly non-standard shrinkage argument that will be sketched here. By expanding the square term in (3.2) we can lower bound (3.3) by

[TABLE]

where $V_{i}:=|Z_{i}|\int|\mu_{i}|d\Pi(\mu|Z_{i},\sigma^{2}).$ For $\sigma^{2}$ close to $\sigma_{0}^{2}$ , we have

[TABLE]

For an improper uniform prior, one can check that $V_{i}\geq Z_{i}^{2}$ , making the lower bound negative and useless. For a proper prior, there is a shrinkage phenomenon in the sense that for any $c>0$ there are parameters $(\mu_{i}^{0})^{2}\asymp\sigma_{0}^{2}$ such that $V_{i}\leq cZ_{i}^{2}$ , with high $P_{0}^{n}-$ probability. If this is the case then

[TABLE]

which yields the conclusion by choosing $c>0$ small enough.

In Proposition 4.1 we showed that the posterior overshoots the true variance $\sigma_{0}^{2}$ whenever the true means are large enough. By analyzing the Gaussian case in the next section, we see that for small means the posterior will in fact underestimate $\sigma_{0}^{2}$ and that only for a small range of mean vectors, one can hope that the posterior will be able to concentrate around the true variance.

5 Gaussian mixture priors

5.1 Gaussian priors

To illustrate our approach, we first consider an i.i.d. Gaussian prior on the mean vector

[TABLE]

From Theorem 4.2 we already know that the posterior will be inconsistent in this case. Nevertheless, the Gaussian assumption yields more explicit formulas and this allows us to build a hierarchical prior resulting in good posterior contraction properties. By Remark 2.1, the marginal likelihood is the same as in the sequence model with random means (4.1). The marginal posterior is therefore

[TABLE]

which can also be written as the product of two inverse Gamma densities. In view of the Bernstein-von Mises phenomenon, the posterior concentrates around the MLE for parametric problems. Similarly, we can argue here that the posterior will be concentrated around the value $\widehat{\sigma}^{2}$ maximizing the likelihood part of the posterior (5.1). By differentiation, we find $n_{1}\widehat{\sigma}^{2}+n_{2}\widehat{\sigma}^{4}/(\widehat{\sigma}^{2}+\theta^{2})=\|Y\|^{2}+\widehat{\sigma}^{4}\|Z\|^{2}/(\theta^{2}+\widehat{\sigma}^{2})^{2}$ and rearranging yields

[TABLE]

This can be rewritten as

[TABLE]

where we set

[TABLE]

and suppress the dependence of the $O()$ term on $\sigma_{0}^{2}$ and $\mu_{0}.$ Since $\theta$ is fixed, this shows that for $\widehat{\sigma}^{2}=\sigma_{0}^{2}+O_{P}(n^{-1/2}),$ we need

[TABLE]

Differently speaking, to force the maximum $\widehat{\sigma}^{2}$ to be close to $\sigma_{0}^{2},$ the variance $\theta^{2}$ of the prior has to match the empirical variance $\overline{\mu_{0}^{2}}$ of the nuisance parameter. We can also deduce from (5.2) that if $|\overline{\mu_{0}^{2}}-\theta^{2}|\gg n^{-1/2}$ and $\theta$ is fixed, then also $|\widehat{\sigma}^{2}-\sigma_{0}^{2}|\gg n^{-1/2}.$ More precisely, we even have that $\overline{\mu_{0}^{2}}-\theta^{2}\gg n^{-1/2}$ implies $\widehat{\sigma}^{2}-\sigma_{0}^{2}\gg n^{-1/2}$ and $\overline{\mu_{0}^{2}}-\theta^{2}\ll-n^{-1/2}$ implies $\widehat{\sigma}^{2}-\sigma_{0}^{2}\ll-n^{-1/2}.$ This shows that, depending on the size of $\overline{\mu_{0}^{2}}$ compared to $\theta^{2},$ the posterior can either overestimate or underestimate the true variance.

If $\theta$ is allowed to vary with $n$ , we can make the right hand side in (5.2) arbitrarily small by letting $\theta$ tend to infinity. As $\theta^{2}$ is the variance of the prior, the behavior resembles then that of the uniform improper prior, which, as we already know, leads to posterior consistency. If we think of a prior as a prior belief on the parameters, then the prior should not change depending on the amount of available data and, in particular, it is unnatural that the prior becomes more vague if the sample size increases. In the next section we show that there are sample size independent mixture priors leading to parametric posterior contraction rates.

5.2 Mixture priors

Section 4 explains the posterior inconsistency for an i.i.d. prior on the nuisance. It seems unintuitive that introducing dependency on the prior of the nuisance parameter can help avoiding posterior inconsistency for $\sigma_{0}^{2}.$ Surprisingly, this is not true. In this section, we first provide some intuition why mixture priors can resolve the issues of i.i.d. priors. Afterwards, we discuss and analyze a specific prior construction.

Analyzing Gaussian priors above, (5.3) suggests that for any nuisance parameter vector $\mu_{0},$ there exists an i.i.d. prior which seems to work. This i.i.d. prior does, however, depend on the unknown $\mu_{0}$ and can therefore not be chosen without knowledge of the data. Intuitively, if the posterior had the chance to see all possible i.i.d. priors on $\mu,$ instead of just one, it is conceivable that it would automatically select one that is adapted to the unknown nuisance parameter and consequently leads to posterior consistency for the parameter of interest. De Finetti’s theorem [18] states that an exchangeable prior $\nu$ over the infinite sequence $\mu=(\mu^{1},\mu^{2},\dots)$ can be written as a mixture over i.i.d. priors in the sense that

[TABLE]

with $\lambda$ a probability measure on the set of probability densities $\mathcal{P}(\mathbb{R})$ on $\mathbb{R}$ . Assuming interchangeability of the integrals, the posterior (2.5) then becomes

[TABLE]

where $q$ denotes the probability density function of $Q.$ Let $q_{0}$ be the i.i.d. prior maximizing the interior integral. Suppose that this is a unique maximum and that the outer integral is determined by the behavior of the integrand in a suitable neighborhood $\mathcal{S}$ of $q_{0}.$ This means that

[TABLE]

The right hand side is the posterior density of $\sigma^{2}$ for i.i.d. prior $\prod_{i=1}^{n}q_{0}(\mu^{i})$ on the components.

Although this argument is only a sketch, it suggests that something might be gained by mixing over i.i.d. priors instead of just choosing one. Maximizing the marginalized likelihood in (5.1) over $\theta^{2}$ yields

[TABLE]

if the r.h.s. is non-negative. For this choice of $\theta^{2},$ (5.1) becomes $\pi(\sigma^{2}|Y,Z)\propto\sigma^{-n_{1}}\exp(-\|Y\|^{2}/(2\sigma^{2}))\pi(\sigma^{2}).$ The posterior therefore coincides with the posterior density based on the first part of the sample only, which we know has good posterior contraction properties.

Prior. In a first step generate $\theta^{2}\sim\gamma$ , with $\gamma$ a positive Lebesgue density on $\mathbb{R}_{+}.$ Given $\theta^{2},$ each non-zero mean is drawn independently from a centered normal distribution with variance $\theta^{2},$ that is, $\mu_{i}|\theta^{2}\sim\mathcal{N}(0,\theta^{2}),$ $i>n_{1}.$

Another heuristic about the posterior properties for this prior can again be derived by making the link to the associated sequence model with random means (2.8). For the prior considered here, the random means model has the form

[TABLE]

with $\theta^{2}\sim\gamma.$ If $\theta^{2}$ were a second parameter and not generated from $\gamma,$ the variance $\sigma_{0}^{2}$ would not be identifiable if only the $Z_{i}$ ’s are observed. In model (5.5) we know the density $\gamma,$ but this is not enough to consistently reconstruct $\sigma_{0}^{2}$ from the subsample $Z.$ By Remark 2.1, this model leads to the same posterior for $\sigma^{2}.$ The posterior should therefore realize that there is little extractable information about $\sigma_{0}^{2}$ in $Z$ and discard these observations. We will see in the limiting shape result below that this is roughly what happens.

We denote by $\ell(\sigma^{2}|Y)$ and $\ell(\sigma^{2}+\theta^{2}|Z)$ the log-likelihoods of the sub-samples $Y$ and $Z$ coming from model (5.5) with $\sigma^{2}$ replacing $\sigma_{0}^{2},$ that is

[TABLE]

The log-likelihoods appearing in (5.6) can be written in terms of inverse-gamma distributions. We denote by $\operatorname{IG}(\gamma,\beta)$ the inverse-gamma distribution with shape $\gamma>0$ and scale $\beta>0.$ The corresponding p.d.f. is

[TABLE]

where $\Gamma(\cdot)$ is the Gamma function. Rewriting the posterior, we have that

Lemma 5.1.

Under the Gaussian mixture prior, the marginal posterior density has the form

[TABLE]

with $\gamma_{1}=n_{1}/2-1,$ $\beta_{1}=n_{1}\overline{Y^{2}}/2$ and $\gamma_{2}=n_{2}/2-1,$ $\beta_{2}=n_{2}\overline{Z^{2}}/2.$ The $\operatorname{IG}(\gamma_{1},\beta_{1})-$ distribution has mode $\beta_{1}/(\gamma_{1}+1)=\overline{Y^{2}}$ and variance $\beta_{1}^{2}/(\gamma_{1}-1)^{2}(\gamma_{1}-2)=O(n^{-1}),$ whereas the $\operatorname{IG}(\gamma_{2},\beta_{2})-$ distribution has mode $\beta_{2}/(\gamma_{2}+1)=\overline{Z^{2}}$ and variance $\beta_{2}^{2}/(\gamma_{2}-1)^{2}(\gamma_{2}-2)=O(n^{-1}).$

Starting from Lemma 5.1, we can develop a heuristic argument on how to recover the shape of the limit posterior distribution. We interpret the posterior $\Pi(\cdot|Y,Z)$ with density (5.8) as the marginalized version, over the set $\theta^{2}\in(0,+\infty),$ of the distribution $\widetilde{\Pi}(\cdot|Y,Z)$ whose density is given by

[TABLE]

and refer to $\widetilde{\Pi}(\cdot|Y,Z)$ as the joint posterior on $(\sigma^{2},\theta^{2})\in(0,+\infty)^{2}.$ The first step is double localization. Thanks to the exponential tails of the inverse Gamma distribution, the joint posterior $\widetilde{\Pi}(\cdot|Y,Z)$ asymptotically concentrates on the set $\{\sigma^{2}\in B_{1}\}\cap\{\theta^{2}\in B_{2}\},$ with $B_{1}$ a $O(\zeta_{n})$ -ball centered at $\overline{Y^{2}}$ and $B_{2}$ a $O(\zeta_{n})$ -ball around $0\vee(\overline{Z^{2}}-\overline{Y^{2}})$ for a sequence $\zeta_{n}\asymp\sqrt{\log n/n}.$ This also implies that the joint posterior (5.9) is arbitrarily close, in total variation distance, to the truncated posterior distribution with density $\widetilde{\pi}(\sigma^{2},\theta^{2}|Y,Z)\mathbf{1}(\{\sigma^{2}\in B_{1}\}\cap\{\theta^{2}\in B_{2}\}).$ In particular, this means that the hyperparameter $\theta^{2}$ concentrates on a neighborhood of the maximal value derived in (5.4).

Arguing as in the classical proof of the Bernstein-von Mises theorem, we can then show that the truncated posterior distribution will asymptotically not depend on the prior and prove that the posterior given by (5.8) behaves asymptotically like

[TABLE]

Using essentially Laplace approximation, we show that the log-likelihoods $\ell(\sigma^{2}|Y)$ and $\ell(\sigma^{2}+\theta^{2}|Z)$ in (5.6) can be always uniformly approximated by a second-order Taylor expansion around their maxima $\overline{Y^{2}}$ and $\overline{Z^{2}}-\sigma^{2},$ and thus the localized posterior converges in total variation distance to a distribution with density

[TABLE]

whose factors are a truncated Gaussian density with mode $\overline{Y^{2}}$ and variance $2\sigma_{0}^{4}/n_{1}=O(n^{-1})$ and the integral of a truncated Gaussian density with mode $\overline{Z^{2}}-\sigma^{2}$ and variance $2(\sigma_{0}^{2}+\overline{\mu_{0}^{2}})^{2}/n_{2}=O(n^{-1}).$ By undoing the localization argument, we can show that the restriction to the sets $B_{1}$ and $B_{2}$ can be removed from (5.11) and the posterior given by (5.8) converges in total variation distance to the posterior limit distribution

[TABLE]

with $\Phi$ the c.d.f. of the standard normal distribution. Recall that $\overline{Z^{2}}\approx\sigma_{0}^{2}+\overline{\mu_{0}^{2}}.$ This suggests that the term involving $\Phi$ in the posterior limit distribution should asymptotically disappear if $\overline{\mu_{0}^{2}}\gg n^{-1/2}.$ The limit of the posterior should then be the truncated Gaussian

[TABLE]

with mode $\overline{Y^{2}}$ and variance $2\sigma_{0}^{4}/n_{1}=O(n^{-1}).$

The next result is a formal statement of the arguments mentioned above. To pass to (5.13) involves an additional $\log n$ -factor in the signal strength of $\overline{\mu_{0}^{2}}.$ Denote by $\|\cdot\|_{\operatorname{TV}}$ the total variation distance and recall that the expectation $E_{0}^{n}$ is taken with respect to model (1.1).

Theorem 5.2.

Let $\Pi_{\infty}(\cdot|Y,Z)$ and $\widetilde{\Pi}_{\infty}(\cdot|Y)$ be the distributions corresponding to the densities (5.12) and (5.13), respectively. If the prior densities $\gamma,\pi:[0,\infty)\rightarrow(0,\infty)$ are positive and uniformly continuous, then, for any compact sets $K\subset(0,\infty),K^{\prime}\subset(-\infty,\infty),$ and $n\rightarrow\infty,$

[TABLE]

Moreover, if $\inf_{\mu_{i}^{0}\in K^{\prime},\forall i}|\mu_{i}^{0}|\gg(\log n/n)^{1/4},$ then

[TABLE]

As a corollary of the proof, posterior contraction around the true variance $\sigma_{0}^{2}$ with contraction rate $O(\sqrt{\log n/n})$ can be established. In the case of large means this is an immediate consequence of the posterior limit $\widetilde{\Pi}_{\infty}(\cdot|Y)$ and the parametric Bernstein-von Mises theorem. For small means it is less obvious because of the non-standard limit of the posterior.

Corollary 5.3.

There exists a constant $M=M(\alpha),$ such that

[TABLE]

The posterior limit distribution is closely related to the class of skew normal distributions, see [1, 2]. We now derive an alternative characterization of the limit distribution. From the argumentation above, the p.d.f.

[TABLE]

can be viewed as the joint posterior limit of $(\sigma^{2},\theta^{2}).$ In particular, the posterior limit distribution is the marginal distribution with respect to $\sigma^{2}.$ As this is clear from the context, we do not write explicitly that the following distributions are conditional on $Y,Z,$ that is, $Y,Z$ are assumed to be fixed.

Lemma 5.4.

Let

[TABLE]

be independent. The distribution with p.d.f. (5.14) coincides with the distribution of

[TABLE]

In particular, the posterior limit distribution $\Pi_{\infty}(\cdot|Y,Z)$ coincides with the distribution of

[TABLE]

If the standard deviations of $\eta,\xi$ are small compared to the means, the posterior limit distribution essentially compares the means $\overline{Y^{2}}$ and $\overline{Z^{2}}.$ This behavior is very reasonable because if $\overline{\mu_{0}^{2}}$ is small, $\overline{Y^{2}}\approx\overline{Z^{2}}$ and the subsample $Z$ becomes informative about $\sigma^{2}.$

The posterior limit depends on unknown quantities. A frequentist estimator mimicking the posterior would be to estimate $\sigma^{2}$ from the MLE for zero means $\overline{X^{2}}$ in the case that the means are small. To detect whether small means are present, we can check whether $\overline{Y^{2}}\geq\overline{Z^{2}},$ which leads then to the estimator

[TABLE]

5.3 Finite sample analysis

We compare the estimators $\widehat{\sigma}_{Y}^{2}=\overline{Y^{2}}$ and $\widetilde{\sigma}^{2}$ to the maximum $\widehat{\sigma}_{\operatorname{map},\infty}^{2}$ and the mean $\widehat{\sigma}_{\operatorname{mean},\infty}^{2}$ of the limit density $\sigma^{2}\mapsto\pi_{\infty}(\sigma^{2}|Y,Z)$ for sample sizes $n\in\{10,100,1000\}.$ As discussed above, we expect to see some differences for small means. We study the performances for $\sigma_{0}^{2}=1$ and $\mu$ the vector with all entries equal to $t/n^{1/4}$ for the values $t\in\{0,1,2,5\}.$ Since $\widehat{\sigma}_{Y}^{2}$ does not depend on the means, the estimator performs equally well in all setups. Table 1 reports the average of the squared errors and the corresponding standard errors based on $10.000$ repetitions. The rescaled MLE $\widehat{\sigma}_{Y}^{2}$ performs worse than any of the other estimators for small signals. Among the other estimators there is no clear ’winner’. For $t=5,$ the risk of all estimators is nearly the same. For larger values of $t,$ our simulation experiments did not show any changes compared to $t=5$ and the results are therefore omitted from the table.

There has been a long-standing debate whether Bayesian methods perform well if interpreted as frequentist methods. Results like the complete class theorem and the Bernstein-von Mises theorem have been foundational in this regard, see [22, 16]. Our theory highlights another instance where Bayes leads to new estimators with good finite sample properties. The analysis moreover shows that the construction of a prior resulting in a posterior with good frequentist properties can be highly non-intuitive.

Appendix A Proofs

A.1 Proofs for Section 3

Proof of Proposition 3.1.

By direct computation,

[TABLE]

Since

[TABLE]

we recover (3.3). ∎

A.2 Proofs for Section 4

Proof of Proposition 4.1.

It is enough to show that the following statements hold for sufficiently large sample size $n.$ Let $Q(u)=\nu([-u,u]^{c})/\nu([-u,u]).$ Since $\nu$ is a distribution function $Q(u)\rightarrow 0$ for $u\rightarrow\infty.$ We work on $I=[\sigma_{0}^{2}/2,2\sigma_{0}^{2}],$ where $\sigma_{0}^{2}$ is chosen such that

[TABLE]

and $\alpha$ denotes the fraction of known zero means in the model. Notice that

[TABLE]

Let

[TABLE]

We choose the non-zero means to be

[TABLE]

The interval $I$ is compact and the prior $\pi$ is continuous and positive on $\mathbb{R}_{+},$ $\inf_{\sigma^{2}\in I}\pi(\sigma^{2})>0.$ Since we also assumed that $\pi^{\prime}$ is continuous, we find that

[TABLE]

for all sufficiently large $n.$ With (3.3) and (A.2),

[TABLE]

Using (3.1) and (3.2), we expand $V(\mu|(Z,\sigma^{2}),$

[TABLE]

Since the integrands in the latter display are positive for $|\mu_{i}|\geq 2|Z_{i}|,$ we can set $V_{i}:=|Z_{i}|\int_{|\mu|\leq 2|Z_{i}|}|\mu|\pi(\mu|Z_{i},\sigma^{2})d\mu$ and bound

[TABLE]

As a next step in the proof, we show

[TABLE]

To prove this inequality, we distinguish the cases $|Z_{i}|>R$ and $|Z_{i}|\leq R,$ decomposing

[TABLE]

with

[TABLE]

For the term $A_{i}$ of (A.8), observe that $A_{i}\leq 2|Z_{i}|\mathbf{1}{(|Z_{i}|>R)}.$ If $|Z_{i}|>R,$ $|Z_{i}|\leq 2|Z_{i}|-R\leq 2|Z_{i}-R/2|$ and therefore,

[TABLE]

Next, we bound the term $B_{i}$ in (A.8). In the sequel, we frequently make use of the fact that $\sigma^{2}\in I.$ The idea is to split the domain of integration $0\leq|\mu|\leq 2|Z_{i}|$ into sets $|\mu|\leq\sigma_{0}$ and $\sigma_{0}<|\mu|\leq 2|Z_{i}|.$ The contribution of the first part can be bounded by $\sigma_{0}.$ More work is needed for the second part. By expanding the square $(\mu-Z_{i})^{2}$ in the exponent, the $Z_{i}^{2}$ -terms in the numerator and denominator cancel against each other, as they do not depend on $\mu,$ and we have

[TABLE]

We now treat numerator and denominator separately. For the numerator, the function $y\mapsto ye^{-y^{2}/2}$ attains its maximum at $y=1$ and is bounded by $e^{-1/2}.$ This means that $|\mu|e^{-\frac{\mu^{2}}{2\sigma^{2}}}\leq\sigma e^{-1/2}\leq\sigma_{0},$ where the last step follows from (A.2). Together with (A.2), we obtain

[TABLE]

using $\mu Z_{i}/\sigma^{2}\leq 4R^{2}/\sigma_{0}^{2}$ to bound the exponent in the integral. To derive a lower bound of the denominator, we replace the integral over $\mathbb{R}$ by an integral over $[-\sigma_{0},\sigma_{0}].$ On this interval, $e^{-\mu^{2}/(2\sigma^{2})}\geq e^{-1}$ and $\mathbf{1}(|Z_{i}|\leq R)e^{\frac{\mu Z_{i}}{\sigma^{2}}}\geq e^{-R^{2}/\sigma^{2}}\geq e^{-2R^{2}/\sigma_{0}^{2}},$ since $\sigma_{0}\leq R.$ We obtain

[TABLE]

Combining this with the upper bound for the numerator yields, with (A.1), (A.3) and the definition of the function $Q(u),$

[TABLE]

Together with (A.9) and (A.7),

[TABLE]

With $|Z_{i}|\sigma_{0}e\leq Z_{i}^{2}/4+\sigma_{0}^{2}e^{2},$ we finally obtain (A.6).

In a final step of the proof, we derive, on an event with large probability, a deterministic lower bound for the right hand side in (A.6). Let $U_{1},\ldots,U_{n_{2}}$ be independent random variables. Rewriting Chebyshev’s inequality yields $P(n^{-1}\sum_{i=1}^{n_{2}}U_{i}>n^{-1}\sum_{i=1}^{n_{2}}(E[U_{i}]-\sigma_{0}^{2}))\geq 1-\sum_{i=1}^{n_{2}}\operatorname{Var}(U_{i})/(n_{2}\sigma_{0}^{2})^{2}.$ We aply this with $U_{i}=Z_{i}^{2}/2-16(Z_{i}-R/2)^{2}.$ Recall that $Z_{i}\sim\mathcal{N}(R/2,\sigma_{0}^{2}).$ Therefore, $E_{0}[Z_{i}^{2}]=R^{2}/4+\sigma_{0}^{2}$ and $E[(Z_{i}-R/2)^{2}]=\sigma_{0}^{2}.$ For the variance, $\operatorname{Var}_{0}(Z_{i}^{2})=R^{2}\sigma_{0}^{2}+\sigma_{0}^{4}$ and $\operatorname{Var}((Z_{i}-R/2)^{2})=\sigma_{0}^{4}.$ Since by assumption $\alpha<1,$ Chebyshev’s inequality yields then $P_{0}^{n}(\mathcal{A}_{n})\to 1$ when $n\to\infty$ for the set

[TABLE]

On $\mathcal{A}_{n},$ we have using (A.3), (A.6) and $Q(\sigma_{0})\leq\exp(-48(17+2e^{2}+24/(1-\alpha))),$

[TABLE]

The assertion follows with (A.5). ∎

Proof of Theorem 4.2.

Proposition 4.1 shows that

[TABLE]

has $P_{0}^{n}$ -probability tending to one. This means that for $\sigma^{2},\widetilde{\sigma}^{2}\in[\sigma_{0}^{2}/2,2\sigma_{0}^{2}],$ with $\sigma^{2}\leq\widetilde{\sigma}^{2},$ we must have $\log\pi(\sigma^{2}|Y,Z)\leq\log\pi(\widetilde{\sigma}^{2}|Y,Z)-n(\widetilde{\sigma}^{2}-\sigma^{2})/\sigma_{0}^{2}.$ Exponentiating this inequality for $\widetilde{\sigma}^{2}=\sigma^{2}+\sigma_{0}^{2}/2,$ yields

[TABLE]

and this completes the proof since $|\sigma^{2}/\sigma_{0}^{2}-1|\leq 1/2$ is equivalent to $\sigma^{2}\in[\sigma_{0}^{2}/2,3\sigma_{0}^{2}/2].$ ∎

A.3 Proofs for Section 5

Proof of Lemma 5.1.

We can write the posterior as

[TABLE]

By using (5.6) and (5.7) we obtain (5.8). ∎

We now prepare for the proof of the limiting shape result. From (5.8), the density (5.9) of the joint posterior is

[TABLE]

With

[TABLE]

define

[TABLE]

It is shown below that the posterior concentrates on $\{\sigma^{2}\in B_{1}\}$ and $\{\theta^{2}\in B_{2}\}.$ The posterior can consequently be approximated by the distribution $\Pi_{1}(\cdot|Y,Z)$ defined through its density (5.10). On the localized set $(\sigma^{2},\theta^{2})\in B_{1}\times B_{2},$ we are able to replace the log-likelihoods by a quadratic expansion. This then allows us to approximate the posterior by $\Pi_{2}(\cdot|Y,Z)$ which is defined as the distribution with density (5.11). We now state the single steps formally and provide the proofs.

Proposition A.1.

If the prior densities $\gamma,\pi:[0,\infty)\rightarrow(0,\infty)$ are positive and uniformly continuous, then there exists a sequence of sets $(A_{n})_{n}$ such that for any compact sets $K\subset(0,\infty),K^{\prime}\subset(-\infty,\infty),$

(i)

$\lim_{n\to\infty}\sup_{\sigma_{0}^{2}\in K,\mu_{i}^{0}\in K^{\prime},\forall i}P_{0}^{n}(A_{n}^{c})=0.$ **

(ii)

With $B_{1},B_{2}$ as defined in (A.15), we have for $n\rightarrow\infty,$

[TABLE]

(iii)

For $n\rightarrow\infty,$

[TABLE]

(iv)

For $n\rightarrow\infty,$

[TABLE]

(v)

For $n\rightarrow\infty,$

[TABLE]

(vi)

For $n\rightarrow\infty,$ and $\inf_{\mu_{i}^{0}\in K^{\prime}}|\mu_{i}^{0}|\gg(\log n/n)^{1/4}$ ,

[TABLE]

Proof of Proposition A.1.

Recall the definition of $\zeta_{n}$ in (A.14) and set

[TABLE]

Let $\underline{\sigma}_{0}^{2}=\inf\{\sigma_{0}^{2}\in K\}>0.$ Define the event

[TABLE]

Since $\delta_{n}\leq 1/2,$ this implies in particular that on $A_{n},$ $\overline{Y^{2}}\wedge\overline{Z^{2}}\geq\underline{\sigma}_{0}^{2}/2.$

Proof of (i): We simplify the notation by introducing the events

[TABLE]

so that $A_{n}=B_{n}\cap D_{n}$ . Thus $P_{0}^{n}(A_{n}^{c})\leq P_{0}^{n}(B_{n}^{c})+P_{0}^{n}(D_{n}^{c}).$ We show that both $P_{0}^{n}(B_{n}^{c})$ and $P_{0}^{n}(D_{n}^{c})$ tend to zero uniformly over compact sets of parameters. By Chebyshev’s inequality,

[TABLE]

Since

[TABLE]

we find

[TABLE]

with $H:=\sup_{\sigma_{0}^{2}\in K,\mu_{i}^{0}\in K^{\prime},\forall i}(\mu_{i}^{0})^{2}/\sigma_{0}^{2}.$ Notice that $H$ is a finite constant since $K\subset(0,\infty)$ and $K^{\prime}$ are compact sets. Because $\delta_{n}=O(\sqrt{\log n/n}),$ the previous probability tends to zero as $n$ increases. We now bound $P_{0}^{n}(B_{n}^{c})$ . Rewriting $B_{n}^{c},$ we obtain

[TABLE]

and again by Chebyshev’s inequality

[TABLE]

which again tends to zero for $n\to\infty$ uniformly over $\sigma_{0}^{2}\in K,\mu_{i}^{0}\in K^{\prime},\forall i.$

Proof of (ii): We work on the event $A_{n}$ defined in (A.17) deriving deterministic lower and upper bounds for the denominator and numerator in the Bayes formula. We start with

[TABLE]

and show that on the event $A_{n}$ this quantity tends to [math] when $n$ tends to infinity. The first part of the proof provides a lower bound for the denominator. For that, we restrict $\sigma^{2}\in\Sigma:=[\overline{Y^{2}}/(1+\delta_{n}),\overline{Y^{2}}/(1+\delta_{n}/2)]$ and $\theta^{2}\in\Theta(\sigma^{2}):=[\overline{Z^{2}}-\sigma^{2},\overline{Z^{2}}(1+\delta_{n})-\sigma^{2}]\subset(0,\infty),$ where the last inclusion follows since by definition of the event $A_{n}$ in (A.17), $Z^{2}-\sigma^{2}\geq Z^{2}-\overline{Y^{2}}/(1+\delta_{n}/2)\geq 0$ . The inner integral in the denominator of (A.18) can be lower bounded by

[TABLE]

Thanks to the definition of $A_{n}$ in (A.17) and $\delta_{n}\leq 1,$ we have $\overline{Z^{2}}\leq\overline{\mu_{0}^{2}}+\sigma_{0}^{2}(1+\delta_{n}),$ so that $\overline{Z^{2}}(1+\delta_{n})\leq 2\overline{\mu_{0}^{2}}+4\sigma_{0}^{2}.$ We then set

[TABLE]

Since $K,K^{\prime}$ are compact sets and $\gamma$ is continuous and positive, we must have $\underline{\gamma}>0.$ Differentiating (5.6) gives $\partial_{\theta^{2}}\ell(\sigma^{2}+\theta^{2}|Y)=\tfrac{1}{2}n_{2}(\overline{Z^{2}}-\sigma^{2}-\theta^{2})/(\sigma^{2}+\theta^{2})^{2},$ so the function $\theta^{2}\mapsto\ell(\sigma^{2}+\theta^{2}|Y)$ is decreasing on $\Theta(\sigma^{2})$ for any $\sigma^{2}.$ As a direct consequence of (5.6), we obtain

[TABLE]

Consequently, for any $\sigma^{2}\in\Sigma,$

[TABLE]

where the last inequality follows since $\overline{Z^{2}}\geq\sigma_{0}^{2}/2$ on $A_{n},$ $\delta_{n}\leq 1,$ and $-\log(1+\delta_{n})\geq-\delta_{n}$ for $\delta_{n}\leq 1.$ The right hand side does not depend on $\sigma^{2}$ anymore. To lower bound the first integral in the denominator of (A.18) we apply a similar argument. By (5.6), $\partial_{\sigma^{2}}\ell(\sigma^{2}|Y)=n_{1}(\overline{Y^{2}}-\sigma^{2})/(2\sigma^{4}).$ This means that the function $\sigma^{2}\mapsto\ell(\sigma^{2}|Y)$ is increasing on $\Sigma$ and (5.6) yields

[TABLE]

On $A_{n},$ $\overline{Y^{2}}\leq\sigma_{0}^{2}(1+\delta_{n})$ and therefore $\overline{Y^{2}}/(1+\delta_{n}/2)\leq 2\sigma_{0}^{2}.$ Set

[TABLE]

so that $\underline{\pi}>0$ because $K$ is a compact set and $\pi$ is continuous and positive. We bound

[TABLE]

using that on $A_{n},$ $\overline{Y^{2}}\geq\sigma_{0}^{2}/2$ and $\log(1+\delta_{n})\geq\delta_{n}-\delta_{n}^{2}/8$ for $0\leq\delta_{n}\leq 1.$ The product of the lower bounds obtained in (A.20) and (A.21) is then a lower bound for the denominator of (A.18).

In the next step we upper bound the numerator of (A.18). Firstly, observe that $\ell(\sigma^{2}+\theta^{2}|Z)\leq\ell(\overline{Z^{2}}|Z)$ and

[TABLE]

Secondly, since $\sigma^{2}\mapsto\ell(\sigma^{2}|Y)$ is increasing on $(0,\overline{Y^{2}}]$ and decreasing on $[\overline{Y^{2}},\infty),$

[TABLE]

The numerator of (A.18) is upper bounded by the product of the bounds obtained in (A.22) and (A.23). Together with the bounds on the denominator in (A.20) and (A.21), and $\zeta_{n}=C\delta_{n},$ we derive, on the event $A_{n},$ the following bound for (A.18):

[TABLE]

The convergence to zero follows since by definition of the constant $C$ in (A.16), $n_{1}C^{2}-4n_{2}-n_{1}>4n_{1}$ and because of $\delta_{n}=O(\sqrt{\log n/n}).$

Along similar lines, we show now that, on the event $A_{n},$ $\widetilde{\Pi}(\theta^{2}\notin B_{2}|Y,Z)\to 0$ as $n$ tends to infinity. Since $\{\theta^{2}\notin B_{2}\}\subset\{\sigma^{2}\notin B_{1}\}\cup(\{\sigma^{2}\in B_{1}\}\cap\{\theta^{2}\notin B_{2}\}),$ and $\widetilde{\Pi}(\sigma^{2}\notin B_{1}|Y,Z)$ tends to zero by (A.24), it is sufficient to establish convergence of

[TABLE]

to zero. We can argue similarly as for the upper bound above using that $\ell(\sigma^{2}|Y)\leq\ell(\overline{Y^{2}}|Y).$ By following the same steps as for (A.22) and (A.23) and using that $a\mapsto\ell(a|Z)$ is increasing on $(0,\overline{Z^{2}}]$ and decreasing on $[\overline{Z^{2}},\infty),$ the numerator in (5.9) integrated over the set $\{\sigma^{2}\in B_{1}\}\cap\{\theta^{2}\notin B_{2}\}$ is upper bounded by

[TABLE]

Together with the lower bounds for the denominator in (A.20) and (A.21), we upper bound (A.25), on the event $A_{n},$ by

[TABLE]

By definition (see (A.16)), the constant $C^{2}>0$ satisfies $n_{2}C^{2}-4n_{2}-n_{1}>4n_{2}.$ Since $\delta_{n}=O(\sqrt{\log n/n}),$ this implies that the right hand side of (A.26) is bounded above by $\lesssim n\exp(-n_{2}\delta_{n}^{2}/4)\to 0,$ as $n\to\infty.$ Together with (A.24), this completes the proof for part (ii).

Proof of (iii): It is well-known that for probability measures $P,Q$ defined on the same measurable space $\mathcal{X},$

[TABLE]

see Lemma E.1 in [26]. With $A=B_{1}\cap B_{2},$ $P=\widetilde{\Pi}(\cdot|Y,Z)$ and $\Pi_{0}(\cdot|Y,Z)$ the distribution with density

[TABLE]

we have that

[TABLE]

By bounding the $L^{1}$ -distance between the densities, we now show that $\Pi_{0}(\sigma^{2}\in\cdot|Y,Z)$ and $\Pi_{1}(\sigma^{2}\in\cdot|Y,Z)$ are close in total variation using the following lemma.

Lemma A.2 (Lemma E.3 in [26]).

If $h(\sigma^{2})\propto d\Pi_{0}(\sigma^{2}\in\cdot|Y,Z)/d\Pi_{1}(\sigma^{2}\in\cdot|Y,Z)$ exists and $\int|h(\sigma^{2})-1|d\Pi_{1}(\sigma^{2}|Y,Z)\leq\delta$ for some $\delta\in(0,1),$ then also

[TABLE]

As $h$ is the Radon-Nikodym derivative up to a multiplicative factor, we can choose

[TABLE]

Then,

[TABLE]

Using the argument above, it remains to prove that $\sup_{\sigma^{2}\in B_{1}}|h(\sigma^{2})-1|\rightarrow 0$ for $n\rightarrow\infty.$ By the definition of $A_{n}$ and due to $\delta_{n}\leq\zeta_{n},$

[TABLE]

Recall that $K$ is a compact set. Since $\pi$ is positive and uniformly continuous,

[TABLE]

Similarly, we have on the event $A_{n},$

[TABLE]

Since $\mu_{i}^{0}\in K^{\prime}$ for all $i,$ the average of the squares $\overline{\mu_{0}^{2}}$ lies in the convex hull of $K^{\prime}$ and

[TABLE]

For real numbers $u,v,$ $uv=(u-1)(v-1)+(u-1)+(v-1)+1.$ We therefore obtain with (A.28) and (A.30), $\sup_{\sigma^{2}\in B_{1}}|h(\sigma^{2})-1|\rightarrow 0$ for $n\rightarrow\infty.$ This completes the proof of $(iii).$

Proof of (iv): We use the same strategy as in the proof of part $(iii),$ applying Lemma A.2 to

[TABLE]

which is a constant multiple of the likelihood ratio of $\Pi_{1}(\sigma^{2}\in\cdot|Y,Z)$ and $\Pi_{2}(\sigma^{2}\in\cdot|Y,Z).$ To verify the assumptions of Lemma A.2, we have to show that $\sup_{\sigma_{0}^{2}\in K}\ |h(\sigma^{2})-1|\rightarrow 0$ for $n\rightarrow\infty.$ Using again the identity $uv=(u-1)(v-1)+(u-1)+(v-1)+1$ and the fact that $|\int f/\int g-1|\leq\sup|f/g-1|,$ we find that it is enough to prove that on the event $A_{n},$

[TABLE]

To verify (A.32), differentiating (5.6) gives

[TABLE]

and by a third-order Taylor expansion around the maximum $\overline{Y^{2}},$

[TABLE]

for some $s^{2}$ between $\sigma^{2}$ and $\overline{Y^{2}}.$ We now control the smaller order terms uniformly over $\sigma^{2}\in B_{1}.$ Observe that also $\overline{Y^{2}},s^{2}\in B_{1}.$ With (A.29), $\sup_{\sigma^{2},\widetilde{\sigma}^{2}\in B_{1}}|\sigma^{2}-\widetilde{\sigma}^{2}|=O(\zeta_{n})$ and $\sigma_{0}^{2}/2\leq\sigma^{2}\leq 2\sigma_{0}^{2}$ for all $\sigma^{2}\in B_{1}.$ Moreover, since $K\subset(0,\infty)$ is compact, $\inf{\sigma_{0}^{2}\in K}>0.$ Together this shows that

[TABLE]

establishing (A.32). To prove (A.33) we argue similarly. Differentiating (5.6) gives

[TABLE]

and by a third-order Taylor expansion around the maximum $\theta_{*}^{2}=\overline{Z^{2}}-\sigma^{2}$ ,

[TABLE]

for some $s^{2}$ between $\theta^{2}$ and $\overline{Z^{2}}-\sigma^{2}.$ If $(\sigma^{2},\theta^{2})\in B_{1}\times B_{2},$ then, on $A_{n},$ both $\overline{Z^{2}}-\sigma^{2}$ and $s^{2}$ are in $B_{2}^{\prime}.$ With (A.29) and (A.31), we have $\sup_{u,v\in B_{2}^{\prime}}|u-v|=O(\zeta_{n})$ and $(\sigma_{0}^{2}+\overline{\mu_{0}^{2}})/2\leq\sigma^{2}+s^{2}\leq 2(\sigma_{0}^{2}+\overline{\mu_{0}^{2}})$ for sufficiently large $n.$ Together with the reasoning for (A.32), this leads to

[TABLE]

being bounded by $\lesssim n\zeta_{n}^{3}$ and thus converging to zero.

Proof of (v): Define $\Pi_{3}(\cdot|Y,Z)$ as the distribution on $(0,\infty)^{2},$ with density (5.14), that is,

[TABLE]

and $\widetilde{\Pi}_{3}(\cdot|Y,Z)$ as the localization of $\Pi_{3}(\cdot|Y,Z)$ on $B_{1}\times B_{2},$ that is, the distribution with density

[TABLE]

Here $B_{1},B_{2}$ are as defined in (A.15). The marginal distributions of $\widetilde{\Pi}_{3}(\cdot|Y,Z)$ and $\Pi_{3}(\cdot|Y,Z)$ with respect to $\sigma^{2}$ are $\Pi_{2}(\cdot|Y,Z)$ and $\Pi_{\infty}(\cdot|Y,Z),$ respectively. Applying (A.27) yields

[TABLE]

To prove $(v),$ it remains to show that for $n\rightarrow\infty,$

[TABLE]

By Lemma 5.4, it is enough to prove that on $A_{n},$

[TABLE]

for independent $\xi\sim\mathcal{N}(\overline{Y^{2}},2\sigma_{0}^{4}/n_{1}),\eta\sim\mathcal{N}(\overline{Z^{2}},2(\sigma_{0}^{2}+\overline{\mu_{0}^{2}})^{2}/n_{2}).$ Recall that this and all the following statements in (v) should be understood conditionally on $Y,Z.$

To bound the terms, we heavily rely on the exponential bounds for tail probabilities of Gaussian variables given by Mill’s ratio [17]

[TABLE]

In a first step we derive a lower bound on $P(0\leq\xi\leq\eta).$ Using that on $A_{n},$ $\overline{Y^{2}}/(1+\delta_{n}/2)\leq\overline{Z^{2}}=E[\eta],$ the definition of $\xi,$ the symmetry properties of the $\mathcal{N}(0,1)$ distribution, $\sigma_{0}^{2}/2\leq\overline{Y^{2}}\leq 2\sigma_{0}^{2}$ on $A_{n},$ and Mill’s ratio, we find

[TABLE]

where in the last inequality we used that $x^{2}/(1+x^{2})>\frac{1}{2}$ for $x>1$ .

We now derive an upper bound for $P(\xi\notin B_{1}).$ Using the definition of $\xi,$ $\zeta_{n}\leq 1,$ $\overline{Y^{2}}\geq\sigma_{0}^{2}/2,$ and Mill’s ratio (A.37),

[TABLE]

Next, we obtain a similar bound for $P(\eta-\xi\notin B_{2},\xi\leq\eta,\xi\in B_{1}).$ If we define the difference of two sets $U,V$ as $U-V:=\{u-v:u\in U,v\in V\},$ then, $B_{2}=([\overline{Z^{2}}/(1+\zeta_{n}),\overline{Z^{2}}/(1-\zeta_{n})]-B_{1})\cap\mathbb{R}_{+}.$ On the event $\xi\leq\eta,\xi\in B_{1},$ we have that $\eta\in[\overline{Z^{2}}/(1+\zeta_{n}),\overline{Z^{2}}/(1-\zeta_{n})]$ implies that $\eta-\xi\in B_{2},$ which is equivalent to saying that $\eta-\xi\notin B_{2}$ implies $\eta\notin[\overline{Z^{2}}/(1+\zeta_{n}),\overline{Z^{2}}/(1-\zeta_{n})].$ On $A_{n},$ $|\overline{Z^{2}}-\overline{\mu_{0}^{2}}-\sigma_{0}^{2}|\leq\sigma_{0}^{2}\delta_{n}$ by definition. Because of $\delta_{n}\leq 1/2,$ we obtain $\overline{Z^{2}}\geq(\overline{\mu_{0}^{2}}+\sigma_{0}^{2})/2.$ Together with the symmetry properties of the normal distribution, $\zeta_{n}\leq 1,$ and Mill’s ratio (A.37), this yields

[TABLE]

To prove (A.36), we bound

[TABLE]

and

[TABLE]

Now (A.36) (and therefore (A.35)) follow from the inequalities (A.38), (A.39), (A.40) and the definition of $\delta_{n}.$ This completes the proof of $(v).$

Proof of (vi): Recall the definitions of the densities

[TABLE]

and let

[TABLE]

be their localised versions on $B_{1}$ . It is enough to show that, on $A_{n},$

[TABLE]

For (A.41), we apply (A.27) and the fact that $\Pi_{\infty}(\cdot|Y,Z)$ is the marginal distribution of $\Pi_{3}(\cdot|Y,Z),$ finding

[TABLE]

In $(v)$ we proved that the right hand side converges to zero uniformly over $\sigma_{0}^{2}\in K,\mu_{i}^{0}\in K^{\prime},\forall i.$ For (A.42), we argue similarly, using that

[TABLE]

with $\xi\sim\mathcal{N}(\overline{Y^{2}},2\sigma_{0}^{4}/n_{1}).$ Using (A.39), we see that the right hand side converges to zero, uniformly over $\sigma_{0}^{2}\in K,\mu_{i}^{0}\in K^{\prime},\forall i.$

For (A.43), we apply Lemma A.2. On $A_{n},$ the likelihood ratio of $\Pi_{\infty,B_{1}}(\cdot|Y,Z)$ and $\widetilde{\Pi}_{\infty,B_{1}}(\cdot|Y)$ is given by

[TABLE]

On $A_{n},$

[TABLE]

Uniformly over $\sigma_{0}^{2}\in K$ and $\inf_{\mu_{i}^{0}\in K^{\prime}}|\mu_{i}^{0}|^{2}\gg\zeta_{n},$ the right hand side can be further upper bounded by $-\overline{\mu_{0}^{2}}/2$ for sufficiently large $n.$ Thus,

[TABLE]

Since $n\overline{\mu_{0}^{2}}\gg n\zeta_{n}\to\infty$ for $n\to\infty,$

[TABLE]

This concludes the proof of (vi). ∎

Proof of Theorem 5.2.

We insert $1=\mathbf{1}((Y,Z)\in A_{n})+\mathbf{1}((Y,Z)\notin A_{n})$ in the expectation. Since the total variation distance of probability measures is bounded, the result follows from Proposition A.1. ∎

Proof of Corollary 5.3.

Recall that the posterior is the marginal distribution of $\widetilde{\Pi}(\cdot|Y,Z)$ with respect to $\sigma^{2}.$ By Proposition A.1 (ii), we have that

[TABLE]

Using that on $A_{n},$ $\sigma_{0}^{2}(1-\delta_{n})\leq\overline{Y^{2}}\leq\sigma_{0}^{2}(1+\delta_{n}),$ and $\delta_{n}=C^{-1}\zeta_{n}=O(\sqrt{\log n/n}),$ we obtain

[TABLE]

for a constant $M=M(\alpha)$ that is chosen to be sufficiently large. The claim follows by splitting the expected posterior, inserting $1=\mathbf{1}((Y,Z)\in A_{n})+\mathbf{1}((Y,Z)\notin A_{n})$ in the expectation and using Proposition A.1 (i). ∎

Proof of Lemma 5.4.

To prove the result, we derive an expression for the joint density of $(\xi,\eta-\xi)\big{|}(0\leq\xi\leq\eta).$ Observe that

[TABLE]

The right hand side is zero if $s\leq 0.$ Suppose now that $0\leq s\leq t.$ Conditioning on $\eta,$ the right hand side can be rewritten as

[TABLE]

Taking derivatives $\partial_{s}\partial_{t},$ the density of $(\xi,\eta-\xi)\big{|}(0\leq\xi\leq\eta)$ at point $(s,t)$ equals up to a multiplicative constant $f_{\xi}(s)f_{\eta}(t+s).$ Which completes the proof for the case $0\leq s\leq t.$

The case $0\leq t\leq s$ is similar and the proof for this case therefore omitted.

Since the posterior limit distribution is the marginal over the first component of the joint distribution in (5.14), it must coincide with the distribution of $\xi\big{|}(0\leq\xi\leq\eta).$ ∎

Acknowledgment

We are grateful to an anonymous referee and the AE for many helpful suggestions resulting in a major improvement of the article. The research has been supported by an NWO TOP grant.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Azzalini, A. (1985) A class of distributions which includes the normal ones. Scand. J. Statist. , 12 (2) , 171–178. \MR MR 808153
2[2] Azzalini, A. and Dalla Valle, A. (1996) The multivariate skew-normal distribution. Biometrika , 83 (4) , 715–726. \MR MR 1440039
3[3] Bickel, P. J. and Kleijn J. K. (2012) The semiparametric Bernstein-von Mises theorem. Ann. Statist. , 40 (1) , 206–237. \MR MR 3013185
4[4] Castillo, I. (2008) Lower bounds for posterior rates with Gaussian process priors. Electron. J. Stat. , 2 , 1281–1299. \MR MR 2471287
5[5] Castillo, I. (2012) A semiparametric Bernstein–von Mises theorem for Gaussian process priors. Probab. Theory Related Fields , 152 , 53–99. \MR MR 2875753
6[6] Castillo, I. (2012) Semiparametric Bernstein–von Mises theorem and bias, illustrated with Gaussian process priors. Sankhya A , 74 (2) , 194–221. \MR MR 3021557
7[7] Castillo, I. and Rousseau. J. (2015) A Bernstein–von Mises theorem for smooth functionals in semiparametric models. Ann. Statist. , 43 (6) , 2353–2383. \MR MR 3405597
8[8] Castillo, I. and van der Vaart, A. (2012) Lower bounds for posterior rates with Gaussian process priors. Ann. Statist. , 40 (4) , 2069–2101. \MR MR 3059077

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Bayesian variance estimation in the Gaussian sequence model with partial information on the means

Abstract

keywords:

1 Introduction

2 Likelihood and posterior

Remark 2.1*.*

3 On the derivative of the log-posterior

Proposition 3.1**.**

4 Posterior inconsistency for product priors

Proposition 4.1**.**

Theorem 4.2**.**

5 Gaussian mixture priors

5.1 Gaussian priors

5.2 Mixture priors

Lemma 5.1**.**

Theorem 5.2**.**

Corollary 5.3**.**

Lemma 5.4**.**

5.3 Finite sample analysis

Appendix A Proofs

A.1 Proofs for Section 3

Proof of Proposition 3.1.

A.2 Proofs for Section 4

Proof of Proposition 4.1.

Proof of Theorem 4.2.

A.3 Proofs for Section 5

Proof of Lemma 5.1.

Proposition A.1**.**

Proof of Proposition A.1.

Lemma A.2** (Lemma E.3 in [26]).**

Proof of Theorem 5.2.

Proof of Corollary 5.3.

Proof of Lemma 5.4.

Acknowledgment

*Remark 2.1**.*

Proposition 3.1.

Proposition 4.1.

Theorem 4.2.

Lemma 5.1.

Theorem 5.2.

Corollary 5.3.

Lemma 5.4.

Proposition A.1.

Lemma A.2 (Lemma E.3 in [26]).