Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical   Approaches for Random Effects Models

Masayo Y. Hirose; Partha Lahiri

arXiv:1901.08245·stat.ME·January 25, 2019

Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical Approaches for Random Effects Models

Masayo Y. Hirose, Partha Lahiri

PDF

Open Access

TL;DR

This paper introduces a multi-goal prior for Bayesian hyperparameter selection in random effects models, aligning Bayesian and classical solutions and establishing analytical equivalence of posterior variances with bootstrap estimates.

Contribution

It proposes a novel multi-goal prior that reconciles Bayesian and classical approaches in hierarchical models, with theoretical and practical implications.

Findings

01

Multi-goal prior produces Bayesian solutions matching classical methods.

02

Analytical equivalence between posterior variances and bootstrap MSE estimates.

03

Enhanced understanding of hyperparameter selection in random effects models.

Abstract

The two-level normal hierarchical model has played an important role in statistical theory and applications. In this paper, we first introduce a general adjusted maximum likelihood method for estimating the unknown variance component of the model and the associated empirical best linear unbiased predictor of the random effects. We then discuss a new idea for selecting prior for the hyperparameters. The prior, called a multi-goal prior, produces Bayesian solutions for hyperparmeters and random effects that match (in the higher order asymptotic sense) the corresponding classical solution in linear mixed model with respect to several properties. Moreover, we establish for the first time an analytical equivalence of the posterior variances under the proposed multi-goal prior and the corresponding parametric bootstrap second-order mean squared error estimates in the context of a random…

Equations123

\hat{A}_{i; G} = 0 \leq A < \infty arg max h_{i; G} (A) L_{R E} (A),

\hat{A}_{i; G} = 0 \leq A < \infty arg max h_{i; G} (A) L_{R E} (A),

\hat{A}_{i; G} - \hat{A}_{R E} = \frac{2 l ~ _{i; G}^{(1)} ( A )}{t r [ V ^{- 2} ]} + o_{p} (m^{- 1}),

\hat{A}_{i; G} - \hat{A}_{R E} = \frac{2 l ~ _{i; G}^{(1)} ( A )}{t r [ V ^{- 2} ]} + o_{p} (m^{- 1}),

\displaystyle\hat{b}_{1}=\frac{\partial B_{i}}{\partial A}\Big{|}_{\hat{A}_{RE}},\ \hat{b}_{2}=\frac{\partial^{2}B_{i}}{\partial A^{2}}\Big{|}_{\hat{A}_{RE}},\ \hat{\rho}_{1}=\frac{\partial\log\pi(A)}{\partial A}\Big{|}_{\hat{A}_{RE}},

\displaystyle\hat{b}_{1}=\frac{\partial B_{i}}{\partial A}\Big{|}_{\hat{A}_{RE}},\ \hat{b}_{2}=\frac{\partial^{2}B_{i}}{\partial A^{2}}\Big{|}_{\hat{A}_{RE}},\ \hat{\rho}_{1}=\frac{\partial\log\pi(A)}{\partial A}\Big{|}_{\hat{A}_{RE}},

\displaystyle\hat{h}_{2}=-\frac{1}{m}\frac{\partial^{2}l_{RE}}{\partial A^{2}}\Big{|}_{\hat{A}_{RE}}=\frac{tr[V^{-2}]}{2m}+o_{p}(m^{-1}),

\displaystyle\hat{h}_{3}=-\frac{1}{m}\frac{\partial^{3}l_{RE}}{\partial A^{3}}\Big{|}_{\hat{A}_{RE}}=-\frac{2tr[V^{-3}]}{m}+o_{p}(m^{-1}),

π_{i; G} (A) \propto (A + D_{i}) t r (V^{- 2}) h_{i; G} (A),

π_{i; G} (A) \propto (A + D_{i}) t r (V^{- 2}) h_{i; G} (A),

(i) \hat{B}_{i}^{G H B} = \hat{B}_{i} (\hat{A}_{i; G}) + o_{p} (m^{- 1});

(i) \hat{B}_{i}^{G H B} = \hat{B}_{i} (\hat{A}_{i; G}) + o_{p} (m^{- 1});

(ii) \hat{V}_{i}^{G H B} = V [B_{i} ∣ y] = V a r (\hat{B}_{i} (\hat{A}_{i; G})) + o_{p} (m^{- 1});

(iii) \hat{θ}_{i}^{G H B} = \hat{θ}_{i} (\hat{A}_{i; G}) + o_{p} (m^{- 1}) .

h_{i; G} (A) = o (A^{(m - p) /2}),

h_{i; G} (A) = o (A^{(m - p) /2}),

\hat{A}_{i; M G} = 0 < A < \infty arg max

\hat{A}_{i; M G} = 0 < A < \infty arg max

\hat{B}_{i; M G} = \hat{B}_{i} (\hat{A}_{i; M G}),

\hat{M}_{i; M G} = \hat{M}_{i} (\hat{A}_{i; M G}),

(i) \hat{A}_{i; M G} - \hat{A}_{R E} = O_{p} (m^{- 1});

(i) \hat{A}_{i; M G} - \hat{A}_{R E} = O_{p} (m^{- 1});

(ii) x_{i}^{'} \hat{β} (\hat{A}_{1; M G}, \dots, \hat{A}_{m; M G}) - x_{i}^{'} \hat{β} (\hat{A}_{R E}) = o_{p} (m^{- 1}) .

\hat{B}_{i} (\hat{A}_{i; M G}) - \hat{B}_{i} (\hat{A}_{R E})

\hat{B}_{i} (\hat{A}_{i; M G}) - \hat{B}_{i} (\hat{A}_{R E})

= {E [\hat{A}_{i; M G} - A] - E [\hat{A}_{R E} - A]} b_{1} + o_{p} (m^{- 1})

= - \frac{2 D _{i}}{t r [ V ^{- 2} ] ( A + D _{i} ) ^{3}} + o_{p} (m^{- 1}) .

E [B_{i} ∣ y] = \hat{B}_{i} (\hat{A}_{M G}) + \frac{4 D _{i}}{t r [ V ^{2} ] ( A + D _{i} ) ^{2}} [\frac{1}{A + D _{i}} - \frac{t r [ V ^{- 3} ]}{t r [ V ^{- 2} ]}] + o_{p} (m^{- 1}) .

E [B_{i} ∣ y] = \hat{B}_{i} (\hat{A}_{M G}) + \frac{4 D _{i}}{t r [ V ^{2} ] ( A + D _{i} ) ^{2}} [\frac{1}{A + D _{i}} - \frac{t r [ V ^{- 3} ]}{t r [ V ^{- 2} ]}] + o_{p} (m^{- 1}) .

E [B_{i} ∣ y] = \hat{B}_{i} (\hat{A}_{R E}) + \frac{1}{2 m h ^ _{2}} (\hat{b}_{2} - \frac{h ^ _{3}}{h ^ _{2}} \hat{b}_{1}) + \frac{b ^ _{1}}{m h ^ _{2}} \overset{ρ}{^}_{1} + o_{p} (m^{- 1}) .

E [B_{i} ∣ y] = \hat{B}_{i} (\hat{A}_{R E}) + \frac{1}{2 m h ^ _{2}} (\hat{b}_{2} - \frac{h ^ _{3}}{h ^ _{2}} \hat{b}_{1}) + \frac{b ^ _{1}}{m h ^ _{2}} \overset{ρ}{^}_{1} + o_{p} (m^{- 1}) .

\frac{1}{2 m h _{2}} (b_{2} - \frac{h _{3}}{h _{2}} b_{1}) + \frac{b _{1}}{m h _{2}} ρ_{1} = - \frac{2 D _{i}}{t r [ V ^{- 2} ] ( A + D _{i} ) ^{3}} .

\frac{1}{2 m h _{2}} (b_{2} - \frac{h _{3}}{h _{2}} b_{1}) + \frac{b _{1}}{m h _{2}} ρ_{1} = - \frac{2 D _{i}}{t r [ V ^{- 2} ] ( A + D _{i} ) ^{3}} .

ρ_{1} = \frac{\partial lo g π ( A )}{\partial A}

ρ_{1} = \frac{\partial lo g π ( A )}{\partial A}

= \frac{2}{A + D _{i}} - \frac{2 t r [ V ^{- 3} ]}{t r [ V ^{- 2} ]} .

π (A) \propto (A + D_{i})^{2} t r [V^{- 2}] .

π (A) \propto (A + D_{i})^{2} t r [V^{- 2}] .

π_{i} (A) \propto (A + D_{i})^{2} t r [V^{- 2}] .

π_{i} (A) \propto (A + D_{i})^{2} t r [V^{- 2}] .

π (A) \propto \frac{\sum { 1/ ( A + D _{i} ) ^{2} }}{\sum ω _{i} { D _{i}^{2} / ( A + D _{i} ) ^{2} }}

π (A) \propto \frac{\sum { 1/ ( A + D _{i} ) ^{2} }}{\sum ω _{i} { D _{i}^{2} / ( A + D _{i} ) ^{2} }}

y_{ij} = θ_{ij} + e_{ij} = x_{ij}^{'} β + v_{i} + e_{ij}, (i = 1, \dots, m; j = 1, \dots, n_{i}),

y_{ij} = θ_{ij} + e_{ij} = x_{ij}^{'} β + v_{i} + e_{ij}, (i = 1, \dots, m; j = 1, \dots, n_{i}),

[\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ}]^{'} I_{F}^{- 1} [\frac{\partial B _{i} ( ψ )}{\partial ψ}] =

[\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ}]^{'} I_{F}^{- 1} [\frac{\partial B _{i} ( ψ )}{\partial ψ}] =

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = (\frac{\partial lo g h _{i; G} ( ψ )}{\partial σ _{v}^{2}}, \frac{\partial lo g h _{i; G} ( ψ )}{\partial σ _{e}^{2}})^{'},

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = (\frac{\partial lo g h _{i; G} ( ψ )}{\partial σ _{v}^{2}}, \frac{\partial lo g h _{i; G} ( ψ )}{\partial σ _{e}^{2}})^{'},

H (ψ) = - \frac{1}{2} t r [\frac{\partial ^{2} B _{i} ( ψ )}{\partial ψ ^{2}} I_{F}^{- 1}], \frac{\partial B _{i} ( ψ )}{\partial ψ} = \frac{n _{i}}{( n _{i} σ _{v}^{2} + σ _{e}^{2} ) ^{2}} (- σ_{e}^{2}, σ_{v}^{2})^{'},

H (ψ) = - \frac{1}{2} t r [\frac{\partial ^{2} B _{i} ( ψ )}{\partial ψ ^{2}} I_{F}^{- 1}], \frac{\partial B _{i} ( ψ )}{\partial ψ} = \frac{n _{i}}{( n _{i} σ _{v}^{2} + σ _{e}^{2} ) ^{2}} (- σ_{e}^{2}, σ_{v}^{2})^{'},

I_{F}^{-1}=\frac{2}{a}\left(\begin{array}[]{cc}\sum[(n_{i}-1)/\sigma_{e}^{4}+(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{-2}]&-\sum n_{i}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}\\ -\sum n_{i}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}&\sum n_{i}^{2}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}\\ \end{array}\right),

I_{F}^{-1}=\frac{2}{a}\left(\begin{array}[]{cc}\sum[(n_{i}-1)/\sigma_{e}^{4}+(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{-2}]&-\sum n_{i}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}\\ -\sum n_{i}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}&\sum n_{i}^{2}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})^{2}\\ \end{array}\right),

a = [\sum n_{i}^{2} / (n_{i} σ_{v}^{2} + σ_{e}^{2})^{2}] [\sum {(n_{i} - 1) / σ_{e}^{4} + (n_{i} σ_{v}^{2} + σ_{e}^{2})^{- 2}}] - [\sum n_{i} / (n_{i} σ_{v}^{2} + σ_{e}^{2})^{2}]^{2} .

a = [\sum n_{i}^{2} / (n_{i} σ_{v}^{2} + σ_{e}^{2})^{2}] [\sum {(n_{i} - 1) / σ_{e}^{4} + (n_{i} σ_{v}^{2} + σ_{e}^{2})^{- 2}}] - [\sum n_{i} / (n_{i} σ_{v}^{2} + σ_{e}^{2})^{2}]^{2} .

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = v k,

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = v k,

v = \frac{H ( ψ )}{k ^{'} I _{F}^{- 1} \frac{\partial B _{i} ( ψ )}{\partial ψ}} .

v = \frac{H ( ψ )}{k ^{'} I _{F}^{- 1} \frac{\partial B _{i} ( ψ )}{\partial ψ}} .

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = \frac{H ( ψ )}{k ^{'} I _{F}^{- 1} \frac{\partial B _{i} ( ψ )}{\partial ψ}} k .

\frac{\partial lo g h _{i; G} ( ψ )}{\partial ψ} = \frac{H ( ψ )}{k ^{'} I _{F}^{- 1} \frac{\partial B _{i} ( ψ )}{\partial ψ}} k .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Advanced Statistical Methods and Models

Full text

Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical Approaches for Random Effects Models

Masayo Y. Hirose

The Institute of Statistical Mathematics, Japan

and

Partha Lahiri

Joint Program in Survey Methodology & Department of Mathematics,

University of Maryland, College Park, U.S

Abstract

The two-level normal hierarchical model has played an important role in statistical theory and applications. In this paper, we first introduce a general adjusted maximum likelihood method for estimating the unknown variance component of the model and the associated empirical best linear unbiased predictor of the random effects. We then discuss a new idea for selecting prior for the hyperparameters. The prior, called a multi-goal prior, produces Bayesian solutions for hyperparmeters and random effects that match (in the higher order asymptotic sense) the corresponding classical solution in linear mixed model with respect to several properties. Moreover, we establish for the first time an analytical equivalence of the posterior variances under the proposed multi-goal prior and the corresponding parametric bootstrap second-order mean squared error estimates in the context of a random effects model.

Keywords Adjusted maximum likelihood method, empirical Bayes, empirical best linear unbiased prediction, linear mixed model.

1 Introduction

Simultaneous estimation of several independent normal means has been a topic of great research interest, especially in the 60’s, 70’s and 80’s, after the publication of the celebrated James-Stein estimator (James and Stein, 1961). Let $y=(y_{1},\ldots,y_{m})^{\prime}$ be a maximum likelihood estimator of $\theta=(\theta_{1},\cdots,\theta_{m})^{\prime}$ under the model: $y_{i}|\theta_{i}\stackrel{{\scriptstyle ind.}}{{\sim}}N(\theta_{i},1),\;i=1,\cdots,m.$ James-Stein (1961) provided a surprising result that for $m\geq 3$ , $y$ is an inadmissible estimator of $\theta$ under the model and the sum of squared error loss function: $L(\hat{\theta},\theta)=\sum_{i=1}^{m}(\hat{\theta}_{i}-\theta_{i})^{2}$ . They also showed that the estimator $\hat{\theta}_{i}^{JS}=(1-\hat{B}^{JS})y_{i}$ , where $\hat{B}^{JS}={(m-2)}/{(\sum_{i=1}^{m}y_{i}^{2})}$ , dominates $y$ in terms of the frequentist’s risk. To be specific, $E[\sum_{i}^{m}(\hat{\theta}_{i}^{JS}-\theta_{i})^{2}|\theta]\leq E[\sum_{i}^{m}(y_{i}-\theta_{i})^{2}|\theta]$ , for all $\theta\in\mathcal{R}^{m},$ the $m$ -dimensional Euclidean space, with strict inequality holding for at least one point $\theta$ .

The potential of different extensions of the James-Stein estimator to improve data analysis became transparent when Efron and Morris (1973) provided an empirical Bayesian justification of the James-Stein estimator using the prior $\theta_{i}\sim^{iid.}N(0,A)$ , $i=1,\cdots,m$ . Some earlier applications of empirical Bayesian method include the estimation of: (i) false alarm probabilities in New York City (Carter and Rolph, 1974), (ii) the batting averages of major league baseball players (Efron and Morris, 1975), (iii) prevalence of toxoplasmosis in El Salvador (Efron and Morris, 1975) and (iv) per-capita income of small places in the USA (Fay and Herriott, 1979). More recently, variants of the method given in Efron and Morris (1973) was used: to estimate poverty rates for the US states, counties, and school districts (Citro and Kalton, 2000) and Chilean municipalities (Casas-Cordero, Encina and Lahiri , 2016), and to estimate proportions at the lowest level of literacy for states and counties (Mohadjer et al. 2012).

The following two-level Normal hierarchical model is an extension of the model used by Efron and Morris (1973):

For $i=1,\ldots,m$ ,

Level 1 (sampling model): $y_{i}|\theta_{i}\stackrel{{\scriptstyle\mathrm{ind.}}}{{\sim}}N(\theta_{i},D_{i})$ ;

Level 2 (linking model): $\theta_{i}\stackrel{{\scriptstyle\mathrm{ind.}}}{{\sim}}N(x_{i}^{\prime}\beta,A)$ .

In the above model, level 1 is used to account for the sampling distribution of unbiased estimates $y_{i}$ based on observations taken from the $i$ th population. In this model, we assume that the sampling variances $D_{i}$ are known and this assumption often follows from the asymptotic variances of transformed direct estimates (Efron and Morris, 1975; Carter and Rolph, 1974) or from empirical variance modeling (Fay and Herriot, 1979, Otto and Bell, 1995). Level 2 links the random effects $\theta_{i}$ to a vector of $p$ known auxiliary variables $x_{i}=(x_{i1},\cdots,x_{ip})^{\prime}$ , which are often obtained from various alternative data sources. The parameters $\beta$ and $A$ are generally unknown and are estimated from the available data. We assume that $\beta\in\mathcal{R}^{p},$ the $p$ -dimensional Euclidian space. In the growing field of small area estimation, this model is commonly referred to as the Fay-Herriot model, named after the authors of the landmark paper with more than 1200 citations to date (according to Google Scholar) by Fay and Herriot (1979). For a comprehensive review of small area estimation, the readers are referred to the book by Jiang (2007) and Rao and Molina (2015).

We may be interested in the high dimensional parameters (random effects) $\theta_{i}$ and/or the hyperparameters $\beta$ and $A$ . The estimation problem can be addressed using either Bayesian or linear mixed model classical approach. When hyperparameters are known, both the Bayesian and linear mixed model classical approaches use conditional distribution of $\theta_{i}$ given the data for point estimation and measuring uncertainty of the point estimator. To elaborate, the posterior mean of $\theta_{i}$ , the Bayesian point estimator, is identical to the best predictor of $\theta_{i}$ . Moreover, the posterior variance of $\theta_{i}$ is identical to the mean squared error of the best predictor. When $A$ is known but $\beta$ is unknown, a flat prior is generally assumed for $\beta$ under the Bayesian approach. Interestingly, in this unknown $\beta$ case, the posterior mean and posterior variance of $\beta$ are identical to the maximum likelihood estimator of $\beta$ and the variance of the maximum likelihood estimator, respectively. Moreover, the posterior mean and variance of $\theta_{i}$ are identical to the best linear unbiased predictor of $\theta_{i}$ and its mean squared error, respectively.

When both $\beta$ and $A$ are unknown, flat prior, i.e., $\pi(\beta,A)\propto 1,\;\beta\in\mathcal{R}^{p},A>0$ , is common though a few other priors for $A$ have been considered; see, e.g., Datta et al. (2005) and Morris and Tang (2011). In a linear mixed model classical approach, different estimators of $A$ have been proposed and the estimator of $\beta$ is obtained by plugging in an estimator of $A$ in the maximum likelihood estimator of $\beta$ when $A$ is known. In this general case, the relationship between the Bayesian and linear mixed model classical approach is not clear. The main goal of this paper is to understand the nature of such relationship. In particular, we answer the following question: For a given classical method of estimation of $A$ , is it possible to find a prior on $A$ that will make the Bayesian solution closer to the classical solution in achieving multiple goals (i)-(v), described in Section 3, or a subset of these goals given in Theorem 2?

What would be the parameters of interest in setting the multiple goals? To this end, we first note that Morris and Tang (2011) pointed out the need for accurately estimating the shrinkage parameters $B_{i}=D_{i}/(A+D_{i})$ as they appear linearly in the Bayes estimators of $\theta_{i}$ , which are the prime parameters of interest in many applications like the small area estimation. Moreover, the shrinkage parameters are good indicators of the strength of the prior on the random effects $\theta_{i}$ . Despite the importance of shrinkage parameters, relatively little research has been conducted in order to understand the theoretical properties of existing estimators. For the balanced case when $D_{i}=D,\;i=1,\cdots,m$ , Morris (1983) proposed an exact unbiased estimator of $B=D/(A+D)$ and showed component-wise dominance of the resulting empirical Bayes estimator of $\theta_{i}$ under the joint distribution of $\{(y_{i},\theta_{i}),\;i=1,\cdots,m\}$ when $p\leq m-3.$ For the general unbalanced case, Hirose and Lahiri (2018) proposed an adjusted maximum likelihood estimator of $B_{i}$ that satisfies multiple desirable properties. First, the method yields an estimator of $B_{i}$ that is strictly less than 1, which prevents the overshrinking problem in the related empirical best linear unbiased predictor or simply empirical best predictor of $\theta_{i}$ . Secondly, this adjusted maximum likelihood estimator of $B_{i}$ has the smallest bias among all existing rival estimators in the higher order asymptotic sense. Thirdly, when this adjusted maximum likelihood method is used, second-order unbiased estimator of mean squared error of empirical best linear unbiased predictor can be produced in a straightforward way without additional bias corrections that are necessary for other existing variance component estimation methods. For prior work on the adjusted maximum likelihood method, the readers are referred to Lahiri and Li (2009), Li and Lahiri (2010), Yoshimori and Lahiri (2014a,b), Hirose and Lahiri (2018), and Hirose (2017,2019).

As stated in Morris and Tang (2011), flat prior leads to admissible minimax estimators of the random effects for a special case of the model. In Section 3, we show that the bias of the Bayes estimator of $B_{i}$ , under the flat prior and the two-level model, is $O(m^{-1})$ except for the balanced case when it is of lower order $o(m^{-1})$ . Thus, in general, the Bayes estimator of $B_{i}$ , under the flat prior, has more bias than the adjusted maximum likelihood estimator of Hirose and Lahiri (2018) in the higher order asymptotic sense. In this section, we propose a prior for the hyperparameters that leads to the Bayes estimator of $B_{i}$ with bias of lower order $o(m^{-1})$ and thus is on par with the adjusted maximum likelihood of Hirose and Lahiri (2018). Interestingly, this prior also makes the resulting Bayesian method much closer to the Hirose-Lahiri’s empirical best linear unbiased prediction method in multiple sense. In particular, the posterior variance of the random effect $\theta_{i}$ , under the proposed prior, is identical to both the Taylor series and parametric bootstrap second-order mean squared error estimators of Hirose and Lahiri (2018) in the higher order asymptotic sense. To our knowledge, we establish for the first time the relationship between the Bayesian posterior variance and parametric bootstrap mean squared error estimator in this higher-order asymptotic sense.

The outline of the paper is as follows. In Section 2, we first introduce a classical method for the two level model by proposing a general adjustment factor in estimating $A$ . We show how the method is related to the commonly used residual maximum likelihood method for a given choice of the adjustment factor. We then construct a prior, called a multi-goal prior, that provides a Bayesian solution close (with respect to several properties in higher order asymptotic sense) to classical solution in order to estimate the hyperparameters and random effects. Section 3 discusses prior choice for an important special case considered by Hirose and Lahiri (2018). In addition to the multiple properties discussed in Section 2, this section develops a unique multi-goal prior that establishes a relationship of the posterior variances of the random effects with the Hirose-Lahiri Taylor series and parametric bootstrap mean squared error estimators that do not require the usual complex bias corrections. We reiterate that this paper demonstrates for the first time how to bring the Bayesian and classical parametric bootstrap methods closer in the context of random effects models. In Section 4, we compare the proposed multi-goal prior with the superharmonic prior using a real life data. In Section 5, we discuss issues in extending our results to a general model. All the technical proofs are deferred to the Appendix.

2 Prior Choice for reconciliation of the Bayesian and classical approach

In this section, we first introduce a general classical method for estimation of hyperparameters and random effects in the two-level Normal hierarchical model. Then we construct prior for the hyperparameters so that the corresponding Bayesian method is identical to the classical method in the higher order asymptotic sense with respect to multiple properties.

We first introduce the empirical best linear unbiased predictor of $\theta_{i}$ when the variance component $A$ is estimated by a general adjusted maximum likelihood method. To this end, we define mean squared error of a given predictor $\hat{\theta}_{i}$ of $\theta_{i}$ as $M_{i}(\hat{\theta}_{i})=E(\hat{\theta}_{i}-\theta_{i})^{2}$ , where the expectation is with respect to the joint distribution of $y=(y_{1},\cdots,y_{m})^{\prime}$ and $\theta=(\theta_{1},\cdots,\theta_{m})^{\prime}$ under the two-level normal model. The best linear unbiased predictor $\hat{\theta}_{i}^{BLUP}$ of $\theta_{i}$ , which minimizes $M_{i}(\hat{\theta}_{i})$ among all linear unbiased predictors $\hat{\theta}_{i}$ , is given by $\hat{\theta}_{i}^{BLUP}(A)=(1-B_{i})y_{i}+B_{i}x^{\prime}_{i}\hat{\beta}(A),$ where $B_{i}\equiv B_{i}(A)=D_{i}/(A+D_{i})$ is the shrinkage factor and $\hat{\beta}(A)=(X^{\prime}{V}^{-1}X)^{-1}X^{\prime}{V}^{-1}y$ is the weighted least square estimator of $\beta$ when $A$ is known. In this formula, $X^{\prime}=(x_{1},\cdots,x_{m})$ denotes $p\times m$ matrix of known auxiliary variables and $V=\mbox{diag}(A+D_{1},\cdots,A+D_{m})$ denotes a $m\times m$ diagonal covariance matrix of $y$ .

We consider the following general adjusted maximum likelihood estimator $\hat{A}_{i;G}$ of $A$ :

[TABLE]

where the general adjustment factor $h_{i;G}(A)$ satisfies Condition R5 in Appendix A. Note that maximum likelihood, residual maximum likelihood and different adjusted maximum likelihood estimators of $A$ can be produced using suitable choices of $h_{i;G}(A)$ . Plugging in $\hat{A}_{i;G}$ for $A$ in the best linear unbiased predictor, one obtains an empirical best linear unbiased predictor $\hat{\theta}_{i}^{EB}(\hat{A}_{i;G})$ of $\theta_{i}$ .

Since the residual maximum likelihood estimator of $A$ has the lowest bias among existing estimators in the higher-order asymptotic sense, it is of interest to establish a relationship between the general adjusted maximum likelihood estimator and the residual maximum likelihood estimator. We describe such relationship in Theorem 1; see Appendix A.1 for a proof.

Theorem 1.

Under regularity conditions R1-R5,

[TABLE]

where $\tilde{l}_{i;G}^{(1)}(A)=\frac{\partial\log h_{i;G}(A)}{\partial A}$ .

We now present Theorem 2 for constructing a prior, starting from a given adjustment factor $h_{i,G}(A)$ , in order to bring the resulting Bayesian method closer to the classical method with respect to three criteria. To this end, let $p(\beta,A)$ denote the prior for $(\beta,A)$ . Following Datta et al. (2005), we assume $p(\beta,A)\propto\pi(A)$ and introduce the following notations to be used throughout the paper:

[TABLE]

where $\hat{A}_{RE}$ is the residual maximum likelihood estimator of $A$ , and $l_{RE}$ is the logarithm of residual likelihood.

Theorem 2.

Under Regularity Conditions R1-R5, if $p(\beta,A)\propto\pi_{i;G}(A)$ and

[TABLE]

we have;

[TABLE]

The proof of Theorem 2 is deferred to Appendix A.2.

Remark 1.

We have several remarks on the general multi-goal prior given by (2).

(a)

Theorem 2 is valid for multiple choices of $h_{i;G}$ .

(b)

There exists at least one strictly positive estimate of $A$ if $h_{i;G}(A)>0$ and

[TABLE]

for large $A$ under R6-7.

(c)

Note that $h_{i;G}(A)$ may not qualify as a bonafide prior since it may result in an improper posterior; see Yoshimori and Lahiri (2014b) for an example. However, if we restrict the class of priors to $h_{i;G}(A)=(A+D_{i})^{s}$ for some $s>0$ , we show in Appendix B.1 that $h_{i;G}(A)=o(A^{(m-p-2)/2})$ is a sufficient condition for the propriety of posterior and hence can serve as a prior for $A$ .

On the other hand, it is straightforward to show that $\pi_{i;G}(A)$ given by (2) with $h_{i;G}(A)=o(A^{(m-p)/2})$ yields proper posterior because of multiplication of $h_{i;G}(A)$ by $(A+D_{i})tr(V^{-2})$ . In either case, Theorem 2 can facilitate users for selecting an adjusment factor in the emprical best linear unbiased prediction approach or prior in the Bayesian approach.

3 Multi-Goal Prior for an important special case

Hirose and Lahiri (2018) put forward a classical approach for an important choice of $h_{i;G}(A)$ that satisfies the following desirable properties under regularity conditions R1-R7:

1.

It is desirable to have a second-order unbiased estimator of $B_{i}$ , i.e., $E(\hat{B}_{i})=B_{i}+o(m^{-1})$ .

2.

$0<\mbox{inf}_{m\geq 1}\hat{B}_{i}\leq\mbox{sup}_{m\geq 1}\hat{B}_{i}<1$ (a.s.) for protecting the empirical best linear unbiased predictor from over-shrinking to the regression estimator.

3.

It is desirable to obtain a simple second-order unbiased Taylor series mean squared error estimator of the empirical best linear unbiased predictor without any bias correction; that is, $E[\hat{M}_{i}(\hat{A}_{i})]=M_{i}(\hat{\theta}_{i}^{EB})+o(m^{-1}).$

4.

It is desirable to produce a strictly positive second-order unbiased single parametric bootstrap mean squared error estimator without any bias-correction,

where $\hat{M}_{i}(\hat{A}_{i})$ denotes a estimator of mean squared error of $\hat{\theta}_{i}^{EB}(\hat{A})$ .

Let $\hat{A}_{i;MG}$ , $\hat{B}_{i;MG}$ , $\hat{\theta}_{i;MG}^{EB}$ , $\hat{M}_{i;MG}$ , $\hat{M}_{i;MG}^{boot}$ be the Hirose–Lahiri’s estimators of $A,B_{i},$ the empirical best linear unbiased predictor of $\theta_{i}$ , Taylor series and parametric bootstrap estimators of the mean squared error of the empirical best linear unbiased predictor, respectively. They are given by

[TABLE]

where $\tilde{h}_{i}(A)=h_{+}(A)(A+D_{i})$ with $m>p+2$ ; $h_{+}(A)$ satisfies Conditions R6-R7 in Appendix A; $\theta_{i}^{*}=x_{i}^{\prime}\hat{\beta}(\hat{A}_{1;MG},\ldots,\hat{A}_{m;MG})+u_{i}^{*}$ with $u_{i}^{*}\sim^{ind.}N(0,\hat{A}_{i;MG})$ ; $E_{*}$ is expectation with respect to the two-level Normal hierarchical model with $\beta$ and $A$ replaced by $\hat{\beta}(\hat{A}_{1;MG},\ldots,\hat{A}_{m;MG})$ and $\hat{A}_{i;MG}$ , respectively. Note that the choice of $h_{+}(A)$ is not unique in general. One can use the choice given in Yoshimori and Lahiri (2014a).

The following corollary follows from Theorem 1, Hirose and Lahiri (2018) and the fact that $\frac{\partial\hat{\beta}(A)}{\partial A}=O_{p}(m^{-1/2})$ .

Corollary 1.

Using the regularity conditions,

[TABLE]

In this section, we suggest a Bayesian approach that is close to the classical approach to achieve multiple goals in the higher-order asymptotic sense. To this end, we seek a multi-goal prior on the hyperparameters $(\beta,A)$ that satisfies all the following properties simultaneously:

(i)

$\hat{B}_{i}^{HB}\equiv E[B_{i}|y]=\hat{B}_{i,MG}+o_{p}(m^{-1})$ ;

(ii)

$V[B_{i}|y]=Var(\hat{B}_{i;MG})+o_{p}(m^{-1})$ ;

(iii)

$\hat{\theta}_{i}^{HB}\equiv E[\theta_{i}|y]=\hat{\theta}_{i,MG}+o_{p}(m^{-1})$ ;

(iv)

$V[\theta_{i}|y]=\hat{M}_{i;MG}+o_{p}(m^{-1})$ ;

(v)

$V[\theta_{i}|y]=\hat{M}_{i;MG}^{boot}+o_{p}(m^{-1})$ .

First we prepare the following result, which follows from Corollary 1 (i) and Hirose and Lahiri (2018):

[TABLE]

If we use the flat prior $\pi(A)\propto 1$ , we get the following result using equation (21) of Datta et al. (2005) with $b(A)=B_{i}(A)$ and equation (4):

[TABLE]

This result emphasizes that the flat prior $\pi(A)\propto 1$ cannot achieve Property (i) except for balanced case ( $D_{i}=D$ for all $i$ ). We, therefore, seek a prior $\pi(A)$ to satisfy Property (i), even in unbalanced case. To this end, we also use the following result (5) given in (21) of Datta et al. (2005) with $b(A)=B_{i}(A)$ :

[TABLE]

It is evident from equations (4) and (5) that our desired prior must satisfy the following differential equation, up to the order of $O(m^{-1})$ :

[TABLE]

Note that the differential equation (6) is equivalent to the following differential equation, up to the order of $O_{p}(m^{-1})$ ;

[TABLE]

Hence, we obtain a solution to differential equation (7) as follows:

[TABLE]

Note that the prior (8) depends on $i$ . Therefore, we redefine it as:

[TABLE]

Remark 2.

We have several important remarks on the prior (9).

(a)

The prior satisfies the rest of Properties (ii)-(v) simultaneously, as shown in Appendix B.2. It is remarkable that $\pi_{i}(A)$ given by (9) is the unique prior to achive Properties (i)-(v) simultaneously, up to the order of $O_{p}(m^{-1})$ , since $E[g_{1i}(A)|y]=g_{1i}(\hat{A}_{i;MG})+o_{p}(m^{-1})$ shown in (27).

(b)

The prior given by equation (9) reduces to the Stein’s super-harmonic prior for the balanced case $D_{i}=D,\;i=1,\cdots,m$ , up to the order of $O_{p}(m^{-1})$ .

(c)

Datta et al. (2005) found the same prior by matching (in a higher order asymptotic sense) expected value of the posterior variance of $\theta_{i}$ with the mean squared error of the empirical best linear unbiased predictor with the residual maximum likelihood estimator used for the variance component $A$ . It is interesting to note that the same prior achieves multiple goals, a fact gone unnoticed.

(d)

From the result of Ganesh and Lahiri (2008), the prior

[TABLE]

also satisfies $\sum_{i}^{m}\omega_{i}E[{V}(\theta_{i}|y)-MSE[\hat{\theta}_{i}(\hat{A}_{i;MG})]]=o(m^{-1}).$

4 Data Analysis

In this section, using the 1993 Small Area Income and Poverty Estimates (SAIPE) data set, we demonstrate that our proposed multi-goal prior(MGP) performs better than the superharmonic prior (SHP) in producing Bayesian solutions closer to the multi-goal classical solutions of Hirose and Lahiri (2018). The SAIPE data we use here is from Bell and Franco (2017), available at https://www.census.gov/srd/csrmreports/byyear.html. The data contains direct poverty rates( $y_{i}$ ), associated sampling variances ( $D_{i}$ ), and auxiliary variables ( $x_{i}$ ) derived from administrative and census data for the 50 states and the District of Columbia. Much has been written about SAIPE over the years. See, for instance, the recent book chapter by Bell et al. (2016).

First consider the estimation of the shrinkage parameters $B_{i}$ for all the states. Fig 1 displays classical multi-goal estimates $\hat{B}_{i;MG}$ and Bayes estimates of $B_{i}$ under the superharmonic and the multi-goal priors for all the states arranged in decreasing order of $\hat{B}_{i;MG}$ . Note that the Bayes estimate of $B_{i}$ is an one-dimensional integral, which is approximated by numerical integration using the R function “adaptIntegrate”. Overall, the Bayes estimates under the multi-goal prior are closer to the classical estimates (MGF) than the superharmonic prior.

Next, in Fig 2, we compare the mean squared error estimates by Taylor series (MGF) and parametric bootstrap (PB MG) of Hirose and Lahiri (2018) with the posterior variances under the two different priors. The parametric bootstrap mean squared error estimates use $10^{4}$ bootstrap samples. The two mean squared error estimates are virtually identical. Again our posterior variances under the multi-goal prior are much closer to the mean squared error estimates than the corresponding posterior variances under the superharmonic prior.

5 Discussion

Can we extend our results to a general linear mixed model? To answer this question, we consider the following nested error regression model considered by Battese et al. (1988):

[TABLE]

where $\{v_{1}\ldots,v_{m}\}$ and $\{e_{1},\ldots,e_{m}\}$ are independent with $v_{i}{\sim}N(0,\sigma_{v}^{2})$ and $e_{i}{\sim}N(0,\sigma_{e}^{2})$ ; $x_{ij}$ is a $p$ -dimensional vector of known auxiliary variables; $\beta\in\mathcal{R}^{p}$ is a $p$ -dimensional vector of unknown regression coefficients; $\psi=(\sigma_{v}^{2},\sigma_{e}^{2})^{\prime}$ is an unknown variance component vector; $n_{i}$ is the number of observed unit level data in $i$ -th area.

The condition for achieving desired property 1 given in Section 3, we need to solve the following system of differential equations with shrinkage factor $B_{i}=\sigma_{e}^{2}/(n_{i}\sigma_{v}^{2}+\sigma_{e}^{2})$ , under certain regularity conditions:

[TABLE]

where

[TABLE]

If we use the following adjustment factor $h_{i;G}(\psi)$ for achieving desired property 1:

[TABLE]

for a given two dimensional fixed vector ${k}$ , the solution of $v$ can be obtained as

[TABLE]

This solution thus leads to an appropriate adjustment factor satisfying

[TABLE]

Thus, there exist multiple solutions for $h_{i;G}(\psi)$ satisfying desired property 1 under the nested error regression model (10). Further research is needed to identify a reasonable adjustment factor for the general linear mixed model and to establish a connection with the corresponding Bayesian approach.

Acknowledgements

The first and second authors’ research was supported by JSPS KAKENHI Grant Number 18K12758 and U.S. National Science Foundation Grant SES-1534413, respectively.

Appendix A Appendix

We assume the regularity conditions throughout this paper as follows:

Regularity Conditions

R1: $\mbox{rank}(X)=p$ is bounded for large $m$ ;

R2: The elements of $X$ are uniformly bounded implying $\sup_{j\geq 1}x_{j}(X^{\prime}X)^{-1}x_{j}=O(m^{-1})$ ;

R3: $0<\inf_{i\geq 1}D_{i}\leq\sup_{i\geq 1}D_{i}<\infty$ , $A\in(0,\infty)$ ;

R4: $|\hat{A}_{i}|<C_{ad}m^{\lambda}$ , where $\hat{A}_{i}$ is an estimator of $A$ and $C_{ad}$ a generic positive constant and $\lambda$ is small positive constant.

We also restrict the class of adjustment factors $h_{+}(A)$ and $h_{i;G}(A)$ that satisfy the following regularity conditions, as in Hirose and Lahiri (2018):

R5: $\log h_{i;G}(A)$ is free of $y$ and four times continuously differentiable with respect to $A$ . Moreover, $\frac{\partial^{k}\log h_{i;G}(A)}{\partial A^{k}}$ is of order $O(1)$ , respectively, for large $m$ with $k=0,1,2,3$ ;

R6: $\log h_{+}(A)$ is free of $y$ and four times continuously differentiable with respect to $A$ . Moreover, $\frac{\partial^{k}\log h_{+}(A)}{\partial A^{k}}$ is of order $o(1)$ , for large $m$ with $k=0,1,2,3$ ;

R7; $h_{+}(A)$ is a strictly positive on $A>0$ satisfying that $h_{+}(A)\Big{|}_{A=0}=0$ and $h_{+}(A)<C$ on $A>0$ with a generic positive constant $C$ .

A.1 Proof of Theorem 1

The result follows from an argument similar to the ones given in Das et al. (2004). We note that for the general adjusted maximum likelihood method (1),

[TABLE]

where $l_{i;G}^{(k)}(A)=\frac{\partial^{k}[\tilde{l}_{i;G}(A)+{l}_{RE}(A)]}{\partial A^{k}}$ for $k=1,2,3$ with $\tilde{l}_{i;G}(A)=\log h_{i;G}(A)$ and $\tilde{l}_{RE}(A)=\log L_{RE}(A)$ . In addition, ${A}_{i}^{*}$ lies between $A$ and $\hat{A}_{i;G}$ .

Under regularity conditions, using results of Hirose and Lahiri (2018) and $l_{i;G}^{(1)}(\hat{A}_{i;G})=0$ , we have $\hat{A}_{i;G}-A=O_{p}(m^{-1/2})$ , $\hat{A}_{i}^{*}-A=O_{p}(m^{-1/2})$ , $l_{RE}^{(1)}(\hat{A}_{i;G})=-\tilde{l}_{i;G}^{(1)}(\hat{A}_{i;G})$ , $E[l_{i;G}^{(2)}(A)]=E[l_{RE}^{(2)}(A)]+O(1)=-\frac{tr[V^{-2}]}{2}+O(1)$ , $|l_{RE}^{(2)}({A})|=O_{p}(m)$ , $|l_{RE}^{(3)}(A)|=O_{p}(m)$ .

Hence, (13) yields:

[TABLE]

Using the fact that $l_{RE}^{(1)}({A})=o_{p}(m)$ ,

[TABLE]

Theorem 1 thus follows.

A.2 Proof of Theorem 2

Proof.

of part (i):

Using Theorem 1, we have

[TABLE]

Hence, using (5) given in (21) of Datta et al. (2005), equation (2) implies that the following condition is required in order to satisfy $\hat{B}_{i}^{HB}=\hat{B}_{i}(\hat{A}_{i;G})$ :

[TABLE]

Equation (16) reduces to:

[TABLE]

After solving the above differential equation, up to the order of $O_{p}(m^{-1})$ , we obtain: $\pi_{i;G}(A)\propto h_{i;G}(A)(A+D_{i})tr[V^{-2}].$

Part (i) follows from this result. ∎

Proof.

of part (ii): Under regularity conditions, Hirose and Lahiri (2018) proved the following result:

[TABLE]

Hence, using the result of Datta et al. (2005),

[TABLE]

Thus, the prior (2) satisfies property (ii) from (18).

∎

Proof.

of Part (iii):

Datta et al. (2005) obtain the following result:

[TABLE]

where

[TABLE]

Using (16), we obtain

[TABLE]

Hence, using Theorem 1, (15), (19), (21) and the fact that $\partial\hat{\beta}(A)/\partial A=O_{p}(m^{-1/2})$ , we have, for large $m$ ,

[TABLE]

This completes the proof of part (iii). ∎

Appendix B Appendix

B.1 Proof of Remark 1 (c)

We show that if we use $h_{i;G}(A)$ alone as a prior, $h_{i;G}(A)=o(A^{(m-p-2)/2})$ is a sufficient condition for the propriety of posterior in a constrained class of adjustment factors $h_{i:G}(A)=(A+D_{i})^{s}$ for some $s>0$ and fixed $m$ . We note that

[TABLE]

It is evident that the condition $s<{(m-p-2)}/{2}$ achieves $(\ref{impro})<\infty$ . Thus, the condition $h_{i;G}(A)=o(A^{(m-p-2)/2})$ is a sufficient condition for it to be a bonafide prior for large $A$ .

The following inequality shows that $\pi_{i;G}(A)$ could be a prior if the condition $h_{i;G}(A)=o(A^{(m-p)/2})$ is met.

[TABLE]

Hence, if $h_{i;G}(A)$ in $\pi_{i;G}(A)$ satisfies $s<{(m-p)}/{2}$ , then we have $(\ref{impro2})<\infty$ . Thus, the condition $h_{i;G}(A)=o(A^{(m-p)/2})$ is a sufficient condition for $\pi_{i;G}(A)$ being a bonafide prior in a Bayesian method, as well as an adjustment factor in an adjusted maximum likelihood method.

B.2 Proof of Remark 2 (a)

We show that the prior (9) achieves (ii)-(v).

Proof.

of (ii):

From the result of Datta et al. (2005) and Hirose and Lahiri (2018),

[TABLE]

Hence, the prior achieve the property (ii) from (24). ∎

Proof.

of (iii):

Using (4), it is straightforward to show:

[TABLE]

Using (6) and (20), we obtain the following after some algebra:

[TABLE]

Using (19), Corollary 1 (ii) and (25), we get:

[TABLE]

Property (iii) thus follows from the result (26). ∎

Proof.

of (iv)-(v):

Using (25), we get

[TABLE]

Datta et al. (2005) obtained the following results:

[TABLE]

Using the result given in Butar and Lahiri (2003), Hirose and Lahiri (2018), (25) and (27), we get

[TABLE]

Equation (29) implies that the prior (9) also satisfies (iv)-(v) simultaneously. ∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Battese, G. E., Harter, R. M., and Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association , 83 , 28-36.
2[2]
3[3] Bell, W. R., and Franco, C. (2017). Small Area Estimation-State Poverty Rate Model Research Data Files. Available at https://www.census.gov/srd/csrmreports/byyear.html [accessed October 22, 2018]
4[4]
5[5] Bell, W. R., Basel W. W., Maples, J. J. (2016). An Overview of the U.S. Census Bureau’s Small Area Income and Poverty Estimates Program. In M. Pratesi (Ed.) Analysis of Poverty Data by Small Area Estimation (pp. 349-377). West Sussex: Wiley & Sons, Inc.
6[6]
7[7] Butar, F. B. and Lahiri, P. (2003). On measures of uncertainty of empirical Bayes small-area estimators. J. Statist. Plann. Inference 112 63-76.
8[8] Carter, G.M. and Rolph, J. F. (1974). Empirical Bayes methods applied to estimating fire alarm probabilities. J. Amer. Statist. Assoc. 69 . 880-885.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical Approaches for Random Effects Models

Abstract

1 Introduction

2 Prior Choice for reconciliation of the Bayesian and classical approach

Theorem 1**.**

Theorem 2**.**

Remark 1**.**

3 Multi-Goal Prior for an important special case

Corollary 1**.**

Remark 2**.**

4 Data Analysis

5 Discussion

Acknowledgements

Appendix A Appendix

A.1 Proof of Theorem 1

A.2 Proof of Theorem 2

Proof.

Proof.

Proof.

Appendix B Appendix

B.1 Proof of Remark 1 (c)

B.2 Proof of Remark 2 (a)

Proof.

Proof.

Proof.

Theorem 1.

Theorem 2.

Remark 1.

Corollary 1.

Remark 2.