Marginal and Conditional Multiple Inference for Linear Mixed Model   Predictors

Peter Kramlinger; Tatyana Krivobokova; Stefan Sperlich

arXiv:1812.09250·math.ST·February 25, 2022

Marginal and Conditional Multiple Inference for Linear Mixed Model Predictors

Peter Kramlinger, Tatyana Krivobokova, Stefan Sperlich

PDF

1 Repo

TL;DR

This paper develops a comprehensive framework for multiple inference in linear mixed models, providing valid confidence sets for both marginal and cluster-specific predictors, with practical applications demonstrated through simulations and a Covid-19 mortality study.

Contribution

It introduces a novel approach for cluster-specific multiple inference in linear mixed models, including confidence sets valid under both marginal and conditional laws.

Findings

01

Confidence sets are valid for both marginal and conditional inference.

02

Marginal confidence sets are asymptotically valid for conditional inference.

03

Method allows hypothesis testing without re-sampling techniques.

Abstract

In spite of its high practical relevance, cluster specific multiple inference for linear mixed model predictors has hardly been addressed so far. While marginal inference for population parameters is well understood, conditional inference for the cluster specific predictors is more intricate. This work introduces a general framework for multiple inference in linear mixed models for cluster specific predictors. Consistent confidence sets for multiple inference are constructed under both, the marginal and the conditional law. Furthermore, it is shown that, remarkably, corresponding multiple marginal confidence sets are also asymptotically valid for conditional inference. Those lend themselves for testing linear hypotheses using standard quantiles without the need of re-sampling techniques. All findings are validated in simulations and illustrated along a study on Covid-19 mortality in US…

Tables3

Table 1. Table 1: Coverage of 95%-confidence ellipsoids in model ( 12 ) under conditional law. The relative size to the marginal REML based sets is given in brackets.

			Marginal	Conditional		Marginal	Conditional
	$m$	$n_{i}$	known $𝜹^{v}$	known $λ$ , $𝜹^{v}$	known $𝜹^{v}$	REML	REML
$σ_{v}^{2} = 0.8$ $σ_{e}^{2} = 0.4$	5	5	0.929 (1)	0.929 (1.00)	0.928 (1.01)	0.895 (1)	0.872 (0.82)
	50	5	0.943 (1)	0.943 (0.97)	0.946 (1.26)	0.921 (1)	0.917 (0.88)
	5	10	0.940 (1)	0.940 (1.00)	0.940 (1.00)	0.922 (1)	0.916 (0.94)
	50	10	0.948 (1)	0.948 (1.00)	0.948 (1.03)	0.938 (1)	0.937 (0.97)
	10	50	0.947 (1)	0.947 (1.00)	0.947 (1.00)	0.944 (1)	0.944 (1.00)
$σ_{v}^{2} = 0.6$ $σ_{e}^{2} = 0.6$	5	5	0.926 (1)	0.927 (1.00)	0.923 (1.04)	0.895 (1)	0.840 (0.67)
	50	5	0.944 (1)	0.943 (0.92)	0.955 (2.56)	0.919 (1)	0.924 (1.13)
	5	10	0.938 (1)	0.939 (1.00)	0.938 (1.01)	0.920 (1)	0.906 (0.87)
	50	10	0.948 (1)	0.948 (0.99)	0.950 (1.12)	0.938 (1)	0.938 (0.98)
	10	50	0.947 (1)	0.946 (1.00)	0.947 (1.00)	0.944 (1)	0.944 (1.00)
$σ_{v}^{2} = 0.4$ $σ_{e}^{2} = 0.8$	5	5	0.921 (1)	0.929 (1.02)	0.921 (1.16)	0.894 (1)	0.766 (0.46)
	50	5	0.945 (1)	0.942 (0.79)	0.975 (31.0)	0.900 (1)	0.949 (14.2)
	5	10	0.935 (1)	0.942 (1.01)	0.936 (1.06)	0.915 (1)	0.881 (0.76)
	50	10	0.947 (1)	0.947 (0.96)	0.952 (1.57)	0.936 (1)	0.940 (1.14)
	10	50	0.947 (1)	0.947 (1.00)	0.947 (1.00)	0.944 (1)	0.944 (1.00)

Table 2. Table 2: Tests for the equality of state means of groups by governor.

Governor	$u + 1$	Marginal			Conditional
Governor	$u + 1$	Pivot	$χ_{u + 1, 0.95}^{2}$	p-value	Pivot	$χ_{u + 1, 0.95}^{2} ({\hat{λ}}_{𝐋})$	${\hat{λ}}_{𝐋}$	p-value
Democrat	$23$	$77$	$34$	$5 \times 10^{- 8}$	$104$	$87$	$40$	$5 \times 10^{- 3}$
Republican	$22$	$141$	$33$	$8 \times 10^{- 20}$	$172$	$35$	$1$	$6 \times 10^{- 24}$

Table 3. Table 3: Tests for the equality of state means of groups by census regions.

Census region	$u + 1$	Marginal			Conditional
Census region	$u + 1$	Pivot	$χ_{u + 1, 0.95}^{2}$	p-value	Pivot	$χ_{u + 1, 0.95}^{2} ({\hat{λ}}_{𝐋})$	${\hat{λ}}_{𝐋}$	p-value
Midwest	$10$	$45$	$17$	$8 \times 10^{- 7}$	$53$	$17$	$0.3$	$5 \times 10^{- 8}$
Northeast	$8$	$8$	$14$	$0.33$	$19$	$71$	$41$	$0.99$
South	$16$	$131$	$25$	$1 \times 10^{- 20}$	$147$	$25$	$0$	$8 \times 10^{- 24}$
West	$10$	$19$	$17$	$2 \times 10^{- 2}$	$23$	$17$	$0$	$5 \times 10^{- 3}$

Equations106

y_{i} = X_{i} β + Z_{i} v_{i} + e_{i}, i = 1, \dots, m e_{i} \sim N_{n_{i}} {0_{n_{i}}, R_{i} (δ)}, v_{i} \sim N_{q} {0_{q}, G (δ)},

y_{i} = X_{i} β + Z_{i} v_{i} + e_{i}, i = 1, \dots, m e_{i} \sim N_{n_{i}} {0_{n_{i}}, R_{i} (δ)}, v_{i} \sim N_{q} {0_{q}, G (δ)},

\begin{gathered}\tilde{\mu}_{i}=\tilde{\mu}_{i}\left\{\operatorname{\boldsymbol{\delta}},\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}\right\}=\mathbf{l}_{i}^{t}\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}+\mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}\left\{\operatorname{\mathbf{y}}_{i}-\operatorname{\mathbf{X}}_{i}\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}\right\};\vspace*{5pt}\\ \mbox{where }\ \mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}=\mathbf{h}_{i}^{t}\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})\operatorname{\mathbf{Z}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1},\ \mbox{ and}\\ \hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}=\left\{\sum_{i=1}^{m}\operatorname{\mathbf{X}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1}\operatorname{\mathbf{X}}_{i}\right\}^{-1}\sum_{i=1}^{m}\operatorname{\mathbf{X}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1}\operatorname{\mathbf{y}}_{i}.\end{gathered}

\begin{gathered}\tilde{\mu}_{i}=\tilde{\mu}_{i}\left\{\operatorname{\boldsymbol{\delta}},\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}\right\}=\mathbf{l}_{i}^{t}\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}+\mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}\left\{\operatorname{\mathbf{y}}_{i}-\operatorname{\mathbf{X}}_{i}\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}\right\};\vspace*{5pt}\\ \mbox{where }\ \mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}=\mathbf{h}_{i}^{t}\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})\operatorname{\mathbf{Z}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1},\ \mbox{ and}\\ \hat{\operatorname{\boldsymbol{\beta}}}\big{(}\operatorname{\boldsymbol{\delta}}\big{)}=\left\{\sum_{i=1}^{m}\operatorname{\mathbf{X}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1}\operatorname{\mathbf{X}}_{i}\right\}^{-1}\sum_{i=1}^{m}\operatorname{\mathbf{X}}_{i}^{t}{\operatorname{\mathbf{V}}}_{i}(\operatorname{\boldsymbol{\delta}})^{-1}\operatorname{\mathbf{y}}_{i}.\end{gathered}

\displaystyle\hat{\mu}_{i}=\tilde{\mu}_{i}\left\{\hat{\operatorname{\boldsymbol{\delta}}},\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\hat{\operatorname{\boldsymbol{\delta}}}\big{)}\right\}.

\displaystyle\hat{\mu}_{i}=\tilde{\mu}_{i}\left\{\hat{\operatorname{\boldsymbol{\delta}}},\hat{\operatorname{\boldsymbol{\beta}}}\big{(}\hat{\operatorname{\boldsymbol{\delta}}}\big{)}\right\}.

Σ

Σ

K_{1} (δ)

K_{2} (δ)

K_{3} (δ)

Σ (\hat{δ})

Σ (\hat{δ})

K_{3} (\hat{δ})

\displaystyle\mbox{E}\big{(}\widehat{\operatorname{\boldsymbol{\Sigma}}}\big{)}=\operatorname{\boldsymbol{\Sigma}}+\big{\{}O\big{(}m^{-3/2}\big{)}\big{\}}_{m\times m}.

\displaystyle\mbox{E}\big{(}\widehat{\operatorname{\boldsymbol{\Sigma}}}\big{)}=\operatorname{\boldsymbol{\Sigma}}+\big{\{}O\big{(}m^{-3/2}\big{)}\big{\}}_{m\times m}.

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}\bigg{\}}=1-\alpha+O(m^{-1/2}),

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}\bigg{\}}=1-\alpha+O(m^{-1/2}),

\displaystyle\operatorname{\mathcal{M}_{\alpha}}=\bigg{\{}\operatorname{\boldsymbol{\mu}}\in\mathbb{R}^{m}:\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}\leq\chi^{2}_{m,1-\alpha}\bigg{\}},

\displaystyle\operatorname{\mathcal{M}_{\alpha}}=\bigg{\{}\operatorname{\boldsymbol{\mu}}\in\mathbb{R}^{m}:\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}\leq\chi^{2}_{m,1-\alpha}\bigg{\}},

Σ_{v} = L_{1} (δ^{v}) + L_{2} (δ^{v}) + L_{3} (δ^{v}) + L_{4} (δ^{v});

Σ_{v} = L_{1} (δ^{v}) + L_{2} (δ^{v}) + L_{3} (δ^{v}) + L_{4} (δ^{v});

L_{1} (δ^{v})

L_{1} (δ^{v})

L_{2} (δ^{v})

L_{3} (δ^{v})

L_{4} (δ^{v})

Σ_{v} (\hat{δ})

Σ_{v} (\hat{δ})

L_{4} (δ^{v})

L_{4} (δ^{v})

L_{5} (δ^{v})

\displaystyle\mbox{E}\big{(}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}\big{|}\mathbf{v}\big{)}=\operatorname{\boldsymbol{\Sigma}}_{v}+\big{\{}O\big{(}m^{-3/2}\big{)}\big{\}}_{m\times m}.

\displaystyle\mbox{E}\big{(}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}\big{|}\mathbf{v}\big{)}=\operatorname{\boldsymbol{\Sigma}}_{v}+\big{\{}O\big{(}m^{-3/2}\big{)}\big{\}}_{m\times m}.

a_{i} = (b_{i}^{t} Z_{i} - h_{i}^{t}) J_{i} (Z^{t} Z)^{- 1} Z^{t} + d_{i}^{t} (X^{t} V^{- 1} X)^{- 1} X^{t} V^{- 1},

a_{i} = (b_{i}^{t} Z_{i} - h_{i}^{t}) J_{i} (Z^{t} Z)^{- 1} Z^{t} + d_{i}^{t} (X^{t} V^{- 1} X)^{- 1} X^{t} V^{- 1},

\displaystyle\hat{\lambda}=\max\left[0,\tilde{\lambda}\big{\{}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}(\hat{\operatorname{\boldsymbol{\delta}}}),\hat{\operatorname{\boldsymbol{\beta}}},\hat{\operatorname{\boldsymbol{\delta}}}\big{\}}\right],

\displaystyle\hat{\lambda}=\max\left[0,\tilde{\lambda}\big{\{}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}(\hat{\operatorname{\boldsymbol{\delta}}}),\hat{\operatorname{\boldsymbol{\beta}}},\hat{\operatorname{\boldsymbol{\delta}}}\big{\}}\right],

\displaystyle\tilde{\lambda}\big{(}\operatorname{\boldsymbol{\Sigma}}_{v},\operatorname{\boldsymbol{\beta}},\operatorname{\boldsymbol{\delta}}^{v}\big{)}=\big{\|}\operatorname{\boldsymbol{\Sigma}}_{v}^{-1/2}\mathbf{A}(\operatorname{\boldsymbol{\delta}}^{v})\operatorname{\mathbf{y}}\big{\|}^{2}-\big{\|}\operatorname{\boldsymbol{\Sigma}}_{v}^{-1/2}\mathbf{A}(\operatorname{\boldsymbol{\delta}}^{v})\mathbf{R}(\operatorname{\boldsymbol{\delta}}^{v})^{1/2}\big{\|}^{2}-\big{\|}\operatorname{\boldsymbol{\Sigma}}_{v}^{-1/2}\mathbf{A}(\operatorname{\boldsymbol{\delta}}^{v})\mathbf{X}\operatorname{\boldsymbol{\beta}}\big{\|}^{2}.

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}(\hat{\lambda})\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O(m^{-1/2}),

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}(\hat{\lambda})\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O(m^{-1/2}),

\displaystyle\operatorname{\mathcal{C}_{\alpha}}=\bigg{\{}\operatorname{\boldsymbol{\mu}}\in\mathbb{R}^{m}:\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}\leq\chi^{2}_{m,1-\alpha}(\hat{\lambda})\bigg{\}},

\displaystyle\operatorname{\mathcal{C}_{\alpha}}=\bigg{\{}\operatorname{\boldsymbol{\mu}}\in\mathbb{R}^{m}:\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}\leq\chi^{2}_{m,1-\alpha}(\hat{\lambda})\bigg{\}},

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O(m^{-1/2}).

\displaystyle\mbox{P}\bigg{\{}\big{\|}\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\big{\|}^{2}<\chi^{2}_{m,1-\alpha}\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O(m^{-1/2}).

\mbox{P}\bigg{\{}\frac{(\hat{\mu}_{i}-\mu_{i})^{2}}{\hat{\sigma}_{ii}}<\chi^{2}_{1,1-\alpha}\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O\left(n_{i}^{-1/2}\right),

\mbox{P}\bigg{\{}\frac{(\hat{\mu}_{i}-\mu_{i})^{2}}{\hat{\sigma}_{ii}}<\chi^{2}_{1,1-\alpha}\bigg{|}\mathbf{v}\bigg{\}}=1-\alpha+O\left(n_{i}^{-1/2}\right),

H_{0} : L (μ - a) = 0_{u} \mbox v s . H_{1} : L (μ - a) \neq = 0_{u},

H_{0} : L (μ - a) = 0_{u} \mbox v s . H_{1} : L (μ - a) \neq = 0_{u},

\displaystyle\mathcal{C}_{\alpha,\mathbf{L}}=\left\{\mathbf{a}\in\mathbb{R}^{m}:\big{\|}\big{(}\mathbf{L}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}\mathbf{L}^{t}\big{)}^{-1/2}\mathbf{L}(\hat{\operatorname{\boldsymbol{\mu}}}-\mathbf{a})\big{\|}^{2}\leq\chi^{2}_{u,1-\alpha}(\hat{\lambda}_{\mathbf{L}})\right\}.

\displaystyle\mathcal{C}_{\alpha,\mathbf{L}}=\left\{\mathbf{a}\in\mathbb{R}^{m}:\big{\|}\big{(}\mathbf{L}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}\mathbf{L}^{t}\big{)}^{-1/2}\mathbf{L}(\hat{\operatorname{\boldsymbol{\mu}}}-\mathbf{a})\big{\|}^{2}\leq\chi^{2}_{u,1-\alpha}(\hat{\lambda}_{\mathbf{L}})\right\}.

\mathcal{M}_{\alpha,\mathbf{L}}=\left\{\mathbf{a}\in\mathbb{R}^{m}:\big{\|}\big{(}\mathbf{L}\widehat{\operatorname{\boldsymbol{\Sigma}}}\mathbf{L}^{t}\big{)}^{-1/2}\mathbf{L}(\hat{\operatorname{\boldsymbol{\mu}}}-\mathbf{a})\big{\|}^{2}\leq\chi^{2}_{u,1-\alpha}\right\}.

\mathcal{M}_{\alpha,\mathbf{L}}=\left\{\mathbf{a}\in\mathbb{R}^{m}:\big{\|}\big{(}\mathbf{L}\widehat{\operatorname{\boldsymbol{\Sigma}}}\mathbf{L}^{t}\big{)}^{-1/2}\mathbf{L}(\hat{\operatorname{\boldsymbol{\mu}}}-\mathbf{a})\big{\|}^{2}\leq\chi^{2}_{u,1-\alpha}\right\}.

y_{ij} = β_{0} + x_{ij} β_{1} + v_{i} + e_{ij}, i = 1, \dots, m, j = 1, \dots, n_{i} .

y_{ij} = β_{0} + x_{ij} β_{1} + v_{i} + e_{ij}, i = 1, \dots, m, j = 1, \dots, n_{i} .

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mbox{P}(|T_{i}|\leq z_{1-\alpha/2}|\mathbf{v})=1-\alpha+O\big{(}m^{-1/2}\big{)}.

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mbox{P}(|T_{i}|\leq z_{1-\alpha/2}|\mathbf{v})=1-\alpha+O\big{(}m^{-1/2}\big{)}.

γ_{i} = \frac{σ _{v}^{2}}{σ _{v}^{2} + σ _{e}^{2} / n _{i}},

γ_{i} = \frac{σ _{v}^{2}}{σ _{v}^{2} + σ _{e}^{2} / n _{i}},

∥ Σ^{- 1/2} (\hat{μ} - μ) ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kramlinger/Marginal-and-Conditional-Multiple-Inference-for-Linear-Mixed-Model-Predictors
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Marginal and Conditional Multiple Inference

for Linear Mixed Model Predictors

Peter Kramlinger

[email protected], Department of Statistics and Operations Research, Universität Wien, Oskar-Morgenstern-Platz 1, 1090 Wien, Austria

Tatyana Krivobokova

[email protected], Department of Statistics and Operations Research, Universität Wien, Oskar-Morgenstern-Platz 1, 1090 Wien, Austria

Stefan Sperlich

[email protected], Geneva School of Economics and Management, Université de Genève, 40 Bd du Pont d’Arve, 1211 Genève 4, Switzerland

Abstract

In spite of its high practical relevance, cluster specific multiple inference for linear mixed model predictors has hardly been addressed so far. While marginal inference for population parameters is well understood, conditional inference for the cluster specific predictors is more intricate. This work introduces a general framework for multiple inference in linear mixed models for cluster specific predictors. Consistent confidence sets for multiple inference are constructed under both, the marginal and the conditional law. Furthermore, it is shown that, remarkably, corresponding multiple marginal confidence sets are also asymptotically valid for conditional inference. Those lend themselves for testing linear hypotheses using standard quantiles without the need of re-sampling techniques. All findings are validated in simulations and illustrated along a study on Covid-19 mortality in US state prisons.

Keywords and phrases. Simultaneous inference, multiple testing, mixed parameters, linear mixed models, small area estimation.

1 Introduction

Linear mixed models (LMMs) were introduced by Henderson in 1950s (Henderson,, 1950, 1953) and are applied if repeated measurements on several independent clusters of interest are available. The classical LMM, allowing for random intercepts and slopes, is

[TABLE]

with observations $\operatorname{\mathbf{y}}_{i}\in\mathbb{R}^{n_{i}}$ , known $\operatorname{\mathbf{X}}_{i}\in\mathbb{R}^{n_{i}\times p}$ with $\text{rank}\{(\operatorname{\mathbf{X}}_{1}^{t},\dots,\operatorname{\mathbf{X}}_{m}^{t})^{t}\}=p$ for $p\in\mathbb{N}$ fixed and $\operatorname{\mathbf{Z}}_{i}\in\mathbb{R}^{n_{i}\times q}$ , $q\in\mathbb{N}$ fixed, independent random effects $\mathbf{v}_{i}\in\mathbb{R}^{q}$ , and error terms $\operatorname{\mathbf{e}}_{i}\in\mathbb{R}^{n_{i}}$ , such that $\mbox{Cov}(\operatorname{\mathbf{e}}_{i},\mathbf{v}_{i})=\boldsymbol{0}_{n_{i}\times q}$ . Parameters $\operatorname{\boldsymbol{\beta}}\in\mathbb{R}^{p}$ and $\operatorname{\boldsymbol{\delta}}\in\mathbb{R}^{r}$ , $r\in\mathbb{N}$ fixed, are unknown and we denote $\operatorname{\mathbf{V}}_{i}(\operatorname{\boldsymbol{\delta}})=\mbox{Cov}(\operatorname{\mathbf{y}}_{i})=\operatorname{\mathbf{R}}_{i}(\operatorname{\boldsymbol{\delta}})+\operatorname{\mathbf{Z}}_{i}\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})\operatorname{\mathbf{Z}}_{i}^{t}$ , where $\operatorname{\mathbf{R}}_{i}(\operatorname{\boldsymbol{\delta}})$ and $\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})$ are known up to $\operatorname{\boldsymbol{\delta}}$ . LMM (1) includes the popular nested-error regression model, random coefficient model, and the so-called Fay-Herriot model (Prasad and Rao,, 1990).

Today, LMMs are widely applied in various sciences (Tuerlinckx et al.,, 2006; Jiang,, 2007). Clusters $i=1,\dots,m$ refer for instance to subjects or groups like in biometrics with longitudinal data (Laird and Ware,, 1982; Liang and Zeger,, 1986; Verbeke and Molenberghs,, 2000), to treatment levels in medicine (Francq et al.,, 2019), or to areas like in the field of small area estimation (SAE), to mention only some prominent application domains. For the latter, see Tzavidis et al., (2018) for a recent review, and Pratesi and Salvati, (2008) for examples with interesting time-spatio modeling of $\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})$ and $\operatorname{\mathbf{R}}_{i}(\operatorname{\boldsymbol{\delta}})$ .

Depending on the research question, the inference focus may either be on the population parameter $\operatorname{\boldsymbol{\beta}}$ or on cluster specific characteristics, and thereby associated with random effects $\mathbf{v}_{i}$ . In the former case, a LMM (1) can simply be interpreted as a linear regression model with mean $\operatorname{\mathbf{X}}_{i}\operatorname{\boldsymbol{\beta}}$ and covariance matrix $\operatorname{\mathbf{V}}_{i}(\operatorname{\boldsymbol{\delta}})$ that accounts for complex dependence in the data. Inference about $\operatorname{\boldsymbol{\beta}}$ , where the $\operatorname{\mathbf{v}}_{i}$ in (1) are treated as random, is called marginal and well understood.

Often the interest lies rather in studying mixed parameters, that is, linear combinations of $\operatorname{\boldsymbol{\beta}}$ and $\mathbf{v}_{i}$ , such as $\mu_{i}=\mathbf{l}_{i}^{t}\operatorname{\boldsymbol{\beta}}+\mathbf{h}_{i}^{t}\mathbf{v}_{i}$ , $i=1,\ldots,m$ with known $\mathbf{l}_{i}\in\mathbb{R}^{p}$ and $\mathbf{h}_{i}\in\mathbb{R}^{q}$ . In many situations, cf. Section 4.1 of Tzavidis et al., (2018), inference about $\mu_{i}$ with some realized random effects $\mathbf{v}_{i}$ should then be done conditional on those $\mathbf{v}_{i}$ , i.e., $\operatorname{\mathbf{v}}_{i}$ are treated as fixed. The importance of this distinction, i.e., between marginal and conditional inference in LMMs, was already emphasized by Harville, (1977), and has attracted particular attention in model selection. Specifically, Vaida and Blanchard, (2005), who noted that the conventionally used marginal Akaike information criterion is applicable to the selection of $\operatorname{\boldsymbol{\beta}}$ only, and suggested a conditional version for cluster-specific parameters. Recent contributions (You et al.,, 2016; Lombardía et al.,, 2017) have adopted this distinction and provided bootstrap procedures to accurately estimate the degrees of freedom in the conditional setting.

The focus on conditional inference is particularly meaningful if the cluster effects $\mathbf{v}_{i}$ are rather seen as fixed in practice, and for which random effects are just a modeling device. Today, such interpretation is pretty common (Hodges,, 2013). Even though it seems then more natural to employ fixed effects models, in many practical situations their estimators are inefficient, see e.g. Pfeffermann, (2013). Then, a reasonable approach to obtain estimators of cluster specific effects is to employ model (1) as if $\mathbf{v}_{i}$ were random and obtain a predictor for them. Yet, to perform inference as if $\mathbf{v}_{i}$ were fixed, one needs to condition on the cluster, i.e., on $\mathbf{v}_{i}$ . For example, in the application in Section 4 the mortality in US state prisons is studied. The effects $\mathbf{v}_{i}$ in this example model the state specific effects on Covid-19 mortality (e.g., due to state policy and/or population structure). Since too few observations per state are available, the state effect is predicted within the modeling framework of a LMM. Assume that one is interested in inference about the mean mortality in each state, that is, including the state effect $\mathbf{v}_{i}$ . Since the state effect is not necessarily considered to be random in nature and the inference focus is on the state level, the corresponding inference should be conducted under the conditional law, that is, treating $\mathbf{v}_{i}$ as fixed. Therefore, the main focus of this work is on conditional inference on $\mu_{i}$ . For more discussion on marginal versus conditional inference see also our Supplement, Section 6.1.

There is a large body of literature on constructing confidence intervals for each $\mu_{i}$ separately under the marginal law. Since under the marginal law, estimators $\hat{\mu}_{i}$ of $\mu_{i}$ obtained from (1) are unbiased, much attention has been given to the estimation of the mean squared error $\mbox{MSE}(\hat{\mu}_{i})=\mbox{E}(\mu_{i}-\hat{\mu}_{i})^{2}=\mbox{Var}(\hat{\mu}_{i})$ , where the expectation is taken under the marginal law, that is, treating $\operatorname{\mathbf{v}}_{i}$ as random. To estimate this marginal MSE, one can either plug in an appropriate estimator of $\operatorname{\boldsymbol{\delta}}$ , or use unbiased marginal MSE approximations (Prasad and Rao,, 1990; Datta and Lahiri,, 2000; Das et al.,, 2004). Other approaches to the estimation of marginal MSE comprise a diverse collection of bootstrap methods (González-Manteiga et al.,, 2008; Chatterjee et al.,, 2008). The conditional inference on single $\mu_{i}$ , that is conditioning on $\operatorname{\mathbf{v}}_{i}$ , turns out to be infeasible due to the bias of $\hat{\mu}_{i}$ which arises under the conditional law. While estimation of the bias leads to unacceptably wide intervals (Datta et al.,, 2002; Jiang and Lahiri,, 2006), ignoring it leads to strong under-coverage. This was mentioned as an open problem by Pfeffermann, (2013).

Conditional and marginal inference about all $\mu_{1},\ldots,\mu_{m}$ simultaneously or about a subset thereof has been largely neglected. To the best of our knowledge, only Ganesh, (2009) considered a related problem of Bayesian inference about certain linear combinations of $\mu_{i}$ in the Fay-Herriot model. Reluga et al., (2019) and Reluga et al., (2021) used max-type statistics to construct simultaneous intervals for mixed parameters $\mu_{i}$ of generalized LMM under the marginal law. For a discussion of the average coverage of cluster specific confidence intervals see Zhang, (2007), and Section 3 for its relation to our method. None of these contributions considered multiple inference under the conditional law.

Altogether, there is a lack of results on multiple inference in linear mixed models and a tension between marginal and conditional focus in inference. In this work we address both issues. First, we construct confidence sets for $\mu_{1},\ldots,\mu_{m}$ in LMMs. Second, we consider those joint (or multiple) confidence sets under both, the conditional and the marginal law. For the former we show that the nominal coverage is attained at the usual parametric rate. Then we show that, surprisingly, joint confidence sets built under the marginal law, are accurate at the same parametric rate, and also approximately valid when conditioning on the clusters. This, however, is not true in general for the cluster-wise confidence intervals, i.e., for single $\mu_{i}$ . Next, we use the derived confidence sets to develop multiple tests for linear hypotheses, both on all $\mu_{1},\dots,\mu_{m}$ or on a subvector thereof. Finally, the practical use and relevance of the derived methods is illustrated in simulations and a study on Covid-19 mortality in US state prisons.

The main results are given in Section 2 including applications for comparative statistics and testing linear hypotheses. The results are visualized via simulations in Section 3 and a practical application is given in Section 4. We conclude with a discussion in Section 5. Relevant proofs are deferred to the Appendix, while auxiliary proofs and additional results are provided in the Supplement.

2 Confidence Stets for Multiple Inference

Marginal Simultaneous Prediction Sets

We start by introducing further notation and assumptions, for a general monograph on LMMs and generalizations see e.g. Demidenko, (2004). For model (1), under the marginal law, the best linear unbiased predictor (BLUP) of $\mu_{i}=\mathbf{l}_{i}^{t}\operatorname{\boldsymbol{\beta}}+\mathbf{h}_{i}^{t}\mathbf{v}_{i}$ reads as

[TABLE]

If the variance components $\operatorname{\boldsymbol{\delta}}$ are unknown, they can be estimated using restricted maximum likelihood (REML) as given in (15) in the Supplement, or by Henderson III, as defined by (Searle et al.,, 1992, Chapter 5). Replacing $\operatorname{\boldsymbol{\delta}}$ in (2) by an estimator based on either one of these methods, gives the empirical BLUP (EBLUP)

[TABLE]

Subsequently, the dependency on $\hat{\operatorname{\boldsymbol{\delta}}}$ or $\operatorname{\boldsymbol{\delta}}$ is suppressed if it is clear from the context. Consider the asymptotic scenario

(A1)

$m\rightarrow\infty$ while $\sup_{i}n_{i}=O(1)$ .

It encompasses the standard SAE assumption: For a growing number of clusters there are few observations per cluster. The requirement that $m\rightarrow\infty$ ensures consistent estimation of $\hat{\operatorname{\boldsymbol{\beta}}}(\operatorname{\boldsymbol{\delta}})$ (Demidenko,, 2004, Section 3.6.2). The boundedness condition on the cluster sample sizes is not crucial for the results we derive subsequently, but rather constitute the most unfavorable case under which they hold true. In particular, if some or all $n_{i}\rightarrow\infty$ , certain rates may only improve, for more details see the discussion in the Appendix.

Further, we work with the quite standard, though adapted, regularity conditions

(B1)

$\operatorname{\mathbf{X}}_{i}$ , $\operatorname{\mathbf{Z}}_{i}$ , $\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}})>0$ , $\operatorname{\mathbf{R}}_{i}(\operatorname{\boldsymbol{\delta}})>0$ , $i=1,\ldots,m$ contain only bounded values. 2. (B2)

$\mathbf{d}_{i}^{t}=\mathbf{l}_{i}^{t}-\mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}\operatorname{\mathbf{X}}_{i}$ has entries $d_{ik}=O(1)$ for $k=1,\dots,p$ . 3. (B3)

$\big{\{}\frac{\partial}{\partial\delta_{j}}\mathbf{b}_{i}(\operatorname{\boldsymbol{\delta}})^{t}\operatorname{\mathbf{X}}_{i}\big{\}}_{k}=O(1)$ , for $j=1,\dots,r$ and $k=1,\dots,p$ . 4. (B4)

$\operatorname{\mathbf{V}}_{i}(\operatorname{\boldsymbol{\delta}})$ is linear in the variance components $\operatorname{\boldsymbol{\delta}}$ .

The last condition (B4) implies that the second derivatives of $\operatorname{\mathbf{R}}_{i}$ and $\operatorname{\mathbf{G}}$ w.r.t. $\operatorname{\boldsymbol{\delta}}$ are zero. These assumptions imply that $\mbox{E}(\hat{\mu}_{i}-\mu_{i})=0$ (Jiang,, 2000).

Subsequently, dropping the cluster index $i$ refers to the respective quantity over all clusters: $\operatorname{\mathbf{y}}=(\operatorname{\mathbf{y}}_{1}^{t},\dots,\operatorname{\mathbf{y}}_{m}^{t})^{t}$ , $\operatorname{\mathbf{V}}(\operatorname{\boldsymbol{\delta}})=\text{diag}\{\operatorname{\mathbf{V}}_{i}(\operatorname{\boldsymbol{\delta}})\}_{i=1,\dots,m}$ , $\operatorname{\mathbf{X}}=(\operatorname{\mathbf{X}}_{1}^{t},\dots,\operatorname{\mathbf{X}}_{m}^{t})^{t}$ , etc. Now we can construct prediction sets for $\operatorname{\boldsymbol{\mu}}=(\mu_{1},\ldots,\mu_{m})^{t}$ . The theory below is based on an extension of the MSE estimator for point-wise marginal inference from Prasad and Rao, (1990) for multiple inference. We start by constructing a prediction set $\operatorname{\mathcal{M}_{\alpha}}$ such that $\mbox{P}(\operatorname{\boldsymbol{\mu}}\in\operatorname{\mathcal{M}_{\alpha}})\approx 1-\alpha$ , where P refers to the marginal probability under (1), for a pre-specified level $\alpha\in(0,1)$ . That is, $\operatorname{\boldsymbol{\mu}}$ is considered as a random variable under the marginal law and the corresponding prediction sets are not meant for the conditional inference about a fixed $\operatorname{\boldsymbol{\mu}}$ . Consider an estimator for $\operatorname{\boldsymbol{\Sigma}}=\mbox{Cov}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})$ given by

[TABLE]

where $\mathbf{d}_{i}$ as in (B2) and $\accentset{\approx}{\operatorname{\boldsymbol{\mu}}}=(\accentset{\approx}{\mu}_{1},\dots,\accentset{\approx}{\mu}_{m})^{t}$ with $\accentset{\approx}{\mu}_{i}=\tilde{\mu}_{i}(\operatorname{\boldsymbol{\delta}},\operatorname{\boldsymbol{\beta}})$ . The decomposition (4) partly follows results of Kackar and Harville, (1984). The following lemma gives an estimator for $\operatorname{\boldsymbol{\Sigma}}$ , and evaluates its bias, which will be needed later on.

Lemma 1.

Let model (1) hold and $\hat{\operatorname{\boldsymbol{\delta}}}$ be either a REML estimator or given by Henderson III. Under (A1) and (B1)-(B4), consider the estimator $\widehat{\operatorname{\boldsymbol{\Sigma}}}=\widehat{\operatorname{\boldsymbol{\Sigma}}}(\hat{\operatorname{\boldsymbol{\delta}}})$ for $\operatorname{\boldsymbol{\Sigma}}$ given by

[TABLE]

where $\overline{\operatorname{\mathbf{V}}}$ is the asymptotic covariance matrix of $\hat{\operatorname{\boldsymbol{\delta}}}$ . It then holds

[TABLE]

Here, $\{O(m^{-3/2})\}_{m\times m}$ denotes an $(m\times m)$ matrix with each entry being of order $O(m^{-3/2})$ . This error term comes from the uncertainty in estimating $\operatorname{\boldsymbol{\delta}}$ . The result can be concluded from the point-wise case, which corresponds to the diagonal entries of $\widehat{\operatorname{\boldsymbol{\Sigma}}}$ , and was shown by Prasad and Rao, (1990) for Henderson III and by Datta and Lahiri, (2000) for REML. Since only $\mathbf{K}_{2}(\hat{\operatorname{\boldsymbol{\delta}}})$ contributes to off-diagonal entries, Lemma 1 is a straightforward, though a tedious extension, so that we skip its proof. With $\widehat{\operatorname{\boldsymbol{\Sigma}}}$ at hand we can state now:

Theorem 1.

Let model (1) hold and $\widehat{\operatorname{\boldsymbol{\Sigma}}}=\widehat{\operatorname{\boldsymbol{\Sigma}}}(\hat{\operatorname{\boldsymbol{\delta}}})$ as given in (5). Under (A1) with (B1)-(B4) it holds that

[TABLE]

where $\alpha\in(0,1)$ and $\chi^{2}_{m,1-\alpha}$ is the $\alpha$ -quantile of the $\chi_{m}^{2}$ -distribution.

As before, the error rate is due to the uncertainty in estimating $\operatorname{\boldsymbol{\delta}}$ . The error rates with Lemma 1 and Theorem 1 differ, since the theorem simultaneously considers all $m$ elements of $\operatorname{\boldsymbol{\mu}}\in\mathbb{R}^{m}$ , so that the error is increased from $O(m^{-3/2})$ to $O(m^{-1/2})$ . From Theorem 1 we immediately obtain the prediction set under the marginal law,

[TABLE]

with $P(\operatorname{\boldsymbol{\mu}}\in\operatorname{\mathcal{M}_{\alpha}})\approx 1-\alpha$ , for $\alpha\in(0,1)$ . In the marginal case, $\operatorname{\mathcal{M}_{\alpha}}$ is a prediction region for the random variable $\operatorname{\boldsymbol{\mu}}$ and can therefore not be readily interpreted as a confidence region for a fixed $\operatorname{\boldsymbol{\mu}}$ .

Conditional Simultaneous Confidence Sets

If the inference focus is conditional, i.e., $\mathbf{v}$ is treated as fixed, then the aim is to obtain a confidence set $\operatorname{\mathcal{C}_{\alpha}}$ with $\mbox{P}(\operatorname{\boldsymbol{\mu}}\in\operatorname{\mathcal{C}_{\alpha}}|\mathbf{v})\approx 1-\alpha$ . Notation $\text{P}(\cdot|\mathbf{v})$ means that the probability is taken under model (1), in which $\mathbf{v}=(\mathbf{v}^{t}_{1},\dots,\mathbf{v}_{m}^{t})^{t}$ is given, or ‘fixed’. Since small cluster sample sizes result in unreliable direct estimators, the confidence set $\operatorname{\mathcal{C}_{\alpha}}$ is based on the EBLUP $\hat{\mu}_{i}$ for $\mu_{i}$ from (3).

If effect $\mathbf{v}$ is some fixed parameter, not necessary a realization of a random variable, then model (1) is misspecified and $\operatorname{\boldsymbol{\delta}}$ is not meaningful. Since the parameters are still estimated from (1), one needs to replace $\operatorname{\boldsymbol{\delta}}$ by $\operatorname{\boldsymbol{\delta}}^{v}$ which is an oracle parameter that we define by $\operatorname{\boldsymbol{\delta}}^{v}=\mbox{E}(\hat{\operatorname{\boldsymbol{\delta}}}|\mathbf{v})$ . To control for its variation, $\mathbf{v}$ needs to meet the following conditions which are rather general:

(C1)

$\sum_{i=1}^{m}(\operatorname{\mathbf{v}}_{i})_{e}=O(m^{1/2})$ , $e=1,\dots,q$ ; 2. (C2)

$\sum_{i=1}^{m}\{\operatorname{\mathbf{v}}_{i}\operatorname{\mathbf{v}}_{i}^{t}-\operatorname{\mathbf{G}}(\operatorname{\boldsymbol{\delta}}^{v})\}_{ef}=O(m^{1/2})$ , $e,f=1,\dots,q$ .

The first condition is required to identify $\operatorname{\boldsymbol{\beta}}$ from $(\operatorname{\boldsymbol{\beta}}^{t},\operatorname{\mathbf{v}}^{t})$ . Variants thereof are commonly used in econometrics, see Hsiao, (2014, Section 3.2). Condition (C2) states that the $\mathbf{v}_{i}$ ’s should not be too different from each other; in particular, it ensures that the stochastic part of the observed information matrix is dominated by its deterministic part. A formal discussion and details are given in Lemma 4 in the Supplement. If $\mathbf{v}$ is a realization of a normally distributed random variable, then conditions (C1) and (C2) are readily satisfied.

As in the marginal case, consider the standard regularity conditions (B1)-(B3) and asymptotic scenario (A1). The latter implies that under conditional law $\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})\nrightarrow\mathbf{0}_{m}$ , due to the boundedness of $n_{i}$ , rendering conditional inference for single $\mu_{i}$ infeasible. If $m\rightarrow\infty$ and $n_{i}\rightarrow\infty$ for some fixed $i$ , such inference would be possible for corresponding $\mu_{i}$ , as $\mbox{E}(\hat{\mu}_{i}-\mu_{i}|\mathbf{v})\rightarrow 0$ . Only if $m\rightarrow\infty$ and $n_{i}\rightarrow\infty$ for all $i=1,\dots,m$ , the conditional bias vanishes for all clusters and $\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})\rightarrow\mathbf{0}_{m}$ . Since our results allow multiple conditional inference for non-vanishing bias under (A1), they still holds also if all or some $n_{i}\rightarrow\infty$ . These technical differences are discussed in the Appendix in more detail.

Proceeding as for the marginal case, $\operatorname{\boldsymbol{\Sigma}}_{v}=\mbox{Cov}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})$ can be decomposed as

[TABLE]

where for $\mathbf{K}_{k}=\mathbf{R}_{k}\operatorname{\mathbf{V}}_{k}^{-1}\operatorname{\mathbf{X}}_{k}\left(\sum_{l=1}^{m}\operatorname{\mathbf{X}}_{l}^{t}\operatorname{\mathbf{V}}_{l}^{-1}\operatorname{\mathbf{X}}_{l}\right)^{-1}$ , and with notation from (2)

[TABLE]

This decomposition is similar to (4), but the cross-terms do not vanish. As explained above, $\operatorname{\boldsymbol{\delta}}^{v}$ substitutes now $\operatorname{\boldsymbol{\delta}}$ , although $\hat{\operatorname{\boldsymbol{\delta}}}$ remains the same. The next lemma gives an estimator $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}=\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}(\hat{\operatorname{\boldsymbol{\delta}}})$ for the conditional covariance matrix $\operatorname{\boldsymbol{\Sigma}}_{v}$ , and evaluates its bias.

Lemma 2.

Let model (1) hold. Under (A1), with (B1)-(B4), (C1), (C2),

[TABLE]

where $\widehat{\mathbf{L}}_{3}(\operatorname{\boldsymbol{\delta}}^{v})$ is given in (14) if $\hat{\operatorname{\boldsymbol{\delta}}}$ is obtained via REML, or in (15) if $\hat{\operatorname{\boldsymbol{\delta}}}$ is obtained via Henderson III. Further,

[TABLE]

where $\overline{\operatorname{\mathbf{V}}}$ is the asymptotic covariance matrix of $\hat{\operatorname{\boldsymbol{\delta}}}$ . Then it holds

[TABLE]

The proof is given in the Supplement. As the EBLUP is not unbiased under conditional law, $(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})^{t}\operatorname{\boldsymbol{\Sigma}}_{v}^{-1}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})|\mathbf{v}\sim\chi^{2}_{m}(\lambda)$ for $\lambda=\|\operatorname{\boldsymbol{\Sigma}}_{v}^{-1/2}\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})\|^{2}$ . The non-centrality parameter $\lambda$ depends on the conditional bias, and cannot be estimated directly for any cluster individually, but only jointly. Specifically, let ${\mathbf{A}}(\operatorname{\boldsymbol{\delta}}^{v})=({\mathbf{a}}_{1}^{t},\dots,{\mathbf{a}}_{m}^{t})^{t}\in\mathbb{R}^{m\times n}$ , with

[TABLE]

where $\mathbf{J}_{i}=(0,\dots,0,\mathbf{I}_{q},0,\dots,0)\in\mathbf{R}^{q\times qm}$ , so that ${\mathbf{a}}_{i}\mathbf{Z}\mathbf{v}=\mbox{E}(\tilde{\mu}_{i}-\mu_{i}|\mathbf{v})$ . We propose to estimate the non-centrality parameter by

[TABLE]

Note that $\text{E}\{\tilde{\lambda}\big{(}\operatorname{\boldsymbol{\Sigma}}_{v},\tilde{\operatorname{\boldsymbol{\beta}}},\operatorname{\boldsymbol{\delta}}^{v}\big{)}|\mathbf{v}\}/n=\lambda+O(m^{-1/2})$ . With this estimator we can show

Theorem 2.

Let model (1) hold and $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}=\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}(\hat{\operatorname{\boldsymbol{\delta}}})$ as in (7) and $\hat{\lambda}$ as in (9). Under (A1), with (B1)-(B4), (C1) and (C2) it holds that

[TABLE]

where $\alpha\in(0,1)$ , and $\chi^{2}_{m,1-\alpha}(\hat{\lambda})$ is the $\alpha$ -quantile of the non-central $\chi_{m}^{2}$ -distribution.

Like in Theorem 1, the error of rate $m^{-1/2}$ is due to the uncertainty in estimating $\operatorname{\boldsymbol{\delta}}^{v}$ which enters $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}$ and $\hat{\lambda}$ . The result gives the conditional confidence set

[TABLE]

with $\mbox{P}(\operatorname{\boldsymbol{\mu}}\in\operatorname{\mathcal{C}_{\alpha}}|\mathbf{v})\approx 1-\alpha$ , $\alpha\in(0,1)$ . The practical difficulty when constructing $\operatorname{\mathcal{C}_{\alpha}}$ is the unhandy representation of $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}$ and $\hat{\lambda}$ . Yet, the following result states that the much simpler $\operatorname{\mathcal{M}_{\alpha}}$ , albeit derived for the marginal case, leads to the asymptotically correct coverage in the conditional case as well.

Theorem 3.

Let model (1) hold and $\widehat{\operatorname{\boldsymbol{\Sigma}}}=\widehat{\operatorname{\boldsymbol{\Sigma}}}(\hat{\operatorname{\boldsymbol{\delta}}})$ as in (5). Under (A1) with (B1)-(B4), (C1) and (C2) it holds that

[TABLE]

The theorem states that the misspecification in using the marginal covariance matrix under the conditional law is averaged out across clusters. Remarkably, the rate at which this misspecification vanishes is of the same magnitude as the estimation error in estimating $\operatorname{\boldsymbol{\delta}}^{v}$ . Without any extra cost, at least if looking at the first order, the much simpler marginal confidence set can be applied in the conditional scenario, i.e., $\mbox{P}(\operatorname{\boldsymbol{\mu}}\in\operatorname{\mathcal{M}_{\alpha}}|\mathbf{v})\approx 1-\alpha$ .

In contrast to the previous two theorems, the error term in Theorem 3 is composed by both the estimation error, which relies on all observations, and the misspecification error, which relies on the number of comparisons. This highlights why individual conditional inference based on the marginal MSE is not possible under (A1): If the quadratic form in Theorem 3 is reformulated for one cluster, it follows from the proof that

[TABLE]

which is not useful for $\mbox{sup}_{i}n_{i}=O(1)$ . Only if $n_{i}\rightarrow\infty$ , $\hat{\mu}_{i}$ becomes consistent for $\mu_{i}$ under the conditional law, and nominal coverage for a single $\mu_{i}$ is asymptotically attained.

Theorems 2 and 3 deal with the problem on how conditional inference for mixed parameters could be performed. The latter theorem suggests that multiple inference about $\operatorname{\boldsymbol{\mu}}$ under the conditional law can be performed based on the confidence sets obtained under the marginal law. Figure 1 shows that this effect occurs even though the sets are not necessarily equal. For $m=2$ , the two confidence sets are drawn for randomly generated random effects as described in Section 3. Although being centered around $\hat{\operatorname{\boldsymbol{\mu}}}$ and holding the same coverage probability under the conditional law, they differ in shape. Besides the obvious dependence of both $\operatorname{\mathcal{C}_{\alpha}}$ and $\operatorname{\mathcal{M}_{\alpha}}$ on the random effect $\mathbf{v}$ through $\hat{\operatorname{\boldsymbol{\delta}}}$ , the former also depends on $\mathbf{v}$ via $\lambda$ . The non-centrality parameter extends $\operatorname{\mathcal{C}_{\alpha}}$ to such a degree that the confidence region meets nominal coverage probability. The set $\operatorname{\mathcal{M}_{\alpha}}$ , on the other hand, ignores the bias of $\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}$ that occurs under the conditional law. But this is compensated in that it is inflated by the marginal variance of $\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}$ , which, in contrast to the conditional variance incorporates the variability of the random effects. Theorem 3 postulates that both properties cancel each other out, in such a manner that nominal level is approximately attained. The obvious suspicion that this occurs at the cost of a larger volume for the marginal set gets dispelled in our simulations in Section 3. Clearly, $\operatorname{\mathcal{M}_{\alpha}}$ does not require any separately estimated parameters, which simplifies its implementation.

Conditional Multiple Testing

It is appealing to use the derived results for multivariate hypothesis testing under the conditional law. This can be used to see if $\operatorname{\boldsymbol{\mu}}$ lies in a given subspace of $\mathbb{R}^{m}$ , and includes tests of all kind of linear comparisons between clusters. It can also be applied to examine if cluster specific effects are present within subsets, cf. Section 4. Consider

[TABLE]

where $\mathbf{a}\in\mathbb{R}^{m}$ and $\mathbf{L}$ is a given $(u\times m)$ -matrix with $u\leq m$ and $\text{rank}(\mathbf{L})=u$ , $u=m^{\xi_{1}}$ , and $\xi_{1}\in(0,1]$ bounded away from zero. The dimension $u$ of the linear subspace of $\mathbb{R}^{m}$ corresponds to the number of multiple tests of linear combinations, whereas each linear combination of interest is specified in the rows of $\mathbf{L}$ . For example, for $\mathbf{L}=\mathbf{I}_{m}$ and $\mathbf{a}=(a_{1},\dots,a_{m})^{t}$ , $a_{i}\neq a_{j}$ , $i,j\leq m$ , tests whether the mixed parameters take on some ex-ante assumed value(s). For conditional inference about $\operatorname{\boldsymbol{\mu}}$ , Theorem 2 gives the $\alpha$ -level test for (11), that rejects $H_{0}$ if $\mathbf{a}\not\in\mathcal{C}_{\alpha,\mathbf{L}}$ , where

[TABLE]

with $\hat{\lambda}_{\mathbf{L}}$ being the non-centrality parameter estimate that depends on the covariance $\mathbf{L}\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}\mathbf{L}^{t}$ . This test is consistent with an error of size $O(u^{-1/2})$ . Theorem 3 allows us to employ the confidence set $\operatorname{\mathcal{M}_{\alpha}}$ as well. An $\alpha$ -level test rejects $H_{0}$ if $\mathbf{a}\not\in\mathcal{M}_{\alpha,\mathbf{L}}$ , where

[TABLE]

This test is again consistent at rate $u^{-1/2}$ . It affirms that individual confidence intervals ( $u=1$ ) cannot be constructed using neither $\mathcal{M}_{\alpha,\mathbf{L}}$ nor $\mathcal{C}_{\alpha,\mathbf{L}}$ under (A1), the standard SAE assumption. Note finally that the derived estimators $\widehat{\operatorname{\boldsymbol{\Sigma}}}$ and $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}$ also lend themselves for related testing procedures, such as Tukey’s tests, see our Supplement.

3 Simulation Examples and Performance Study

While it is to be emphasized that the above methods, theory and developments hold for the general LMM (1), and thereby also for complex models with slopes that potentially vary over clusters, we concentrate in the following on a most popular though simpler version, namely the nested error regression model from Battese et al., (1988) with $e_{ij}\sim\mathcal{N}(0,\sigma_{e}^{2}),$ $v_{i}\sim\mathcal{N}(0,\sigma_{v}^{2})$ , and

[TABLE]

The data are simulated as follows. For each given set of the parameters $m$ , $n_{i}$ , $\sigma_{e}^{2}$ , $\sigma^{2}_{v}$ , the value of the cluster effect $v_{i}$ is obtained as a realization of a $\mathcal{N}(0,\sigma^{2}_{v})$ distributed random variable and remains fixed in all Monte Carlo samples. The covariates $x_{ij}$ are drawn once from a standard normal distribution, whereas the coefficient parameters are set to $(\beta_{0},\beta_{1})=(-4.9,0.03)$ , which is similar to the study in Section 4. The parameter of interest is the conditional mean $\mu_{i}=\beta_{0}+\sum_{j=1}^{n_{i}}x_{ij}\beta_{1}/n_{i}+v_{i}$ . Since the random effects are being drawn from a Gaussian distribution, the requirements of Theorem 3 are fulfilled.

Before we study the joint inference, let us briefly look at the cluster-wise one that is, about specific $\mu_{i}$ . As (10) indicates that this cannot be done consistently under that conditional law if $n_{i}\nrightarrow\infty$ , one considers statistics of type $T_{i}:=(\tilde{\mu}_{i}-\mu_{i})\mbox{Var}(\tilde{\mu}_{i}-\mu_{i})^{-1/2}$ , $i=1,...,m$ under the marginal law. Yet, plotting $\mbox{P}(|T_{i}|\leq z_{1-\alpha/2}|\mathbf{v})$ for all $v_{i}$ , with $z_{1-\alpha/2}$ being the two-sided $\alpha$ -quantile of $\mathcal{N}(0,1)$ , Figure 2 shows how much the coverage probabilities of standard confidence intervals vary with the cluster effects $v_{i}$ . These results are based on 1.000 Monte Carlo samples with $m=100$ , $(\sigma_{v}^{2},\sigma_{e}^{2})=(4,4)$ . We see that clusters which comprise a large $|v_{i}|$ , i.e., with most prominent cluster effect, exhibit a severe undercoverage. This is particularly annoying, since such clusters are arguably those that a practitioner might be most interested in (Jiang and Lahiri,, 2006). For large $n_{i}$ , this problem is less pronounced, since the bias for every cluster vanishes asymptotically, and so does the difference between conditional and marginal variance.

On average, i.e., over all clusters, over- and undercoverage cancel out each other, as the following result shows.

Proposition 1.

Let model (1) hold, $\operatorname{\boldsymbol{\delta}}^{v}$ known, $T_{i}=(\tilde{\mu}_{i}-\mu_{i})\mbox{Var}(\tilde{\mu}_{i}-\mu_{i})^{-1/2}$ and $z_{1-\alpha/2}$ the two-sided $\alpha$ -quantile of $\mathcal{N}(0,1)$ . Then, under (A1) with (B1), (B2), (C1) and (C2),

[TABLE]

Although nominal coverage is almost surely not attained for single confidence intervals, the coverage probability of marginal confidence intervals under the conditional law still attains its nominal level on average over all clusters, compare also with Zhang, (2007). For the simulated data in Figure 2 the average coverage is 95.4% (left) and 94.9% (right). The finding in Proposition 1 has been previously described by Wahba, (1983) and Nychka, (1988) in the context of nonparametric regression. For an extension of Proposition 1, more simulation results, and the construction of Tukey’s Intervals, see our Supplement.

For the multiple inference we consider the same design as above. The cluster sample sizes vary from $n_{i}=5$ , cf. the study of Battese et al., (1988), $n_{i}=10$ to $n_{i}=50$ . A study for an unbalanced data set, where in some cluster even $n_{i}=1$ is taken, is provided in the Supplement. We investigate different ratios of $\sigma_{v}^{2}$ and $\sigma_{e}^{2}$ with values that again are motivated by the case study estimates in Section 4. To study different ratios instead of only the one that appears in our case study is interesting because for model (12) the BLUP can be expressed as an average of a direct estimator with weight $\gamma_{i}$ and an estimator for the effect shared by all clusters (often called national estimator) with weight $1-\gamma_{i}$ . Often, $\gamma_{i}$ is referred to as intraclass correlation coefficient (ICC)

[TABLE]

and plays a key role in the reliability of $\widehat{\mu}_{i}$ . Estimates $\hat{\operatorname{\boldsymbol{\mu}}}$ and $\widehat{\operatorname{\boldsymbol{\Sigma}}}$ , as well as $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}$ and $\hat{\lambda}$ are calculated, and it is checked whether $\operatorname{\boldsymbol{\mu}}$ lies within the $95\%-$ confidence sets $\operatorname{\mathcal{M}_{\alpha}}$ and $\operatorname{\mathcal{C}_{\alpha}}$ . The structure of $\widehat{\operatorname{\boldsymbol{\Sigma}}}$ and $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}$ allows matrix inversion by Woodbury’s formula, leading to fast calculations. Table 1 contains results based on $10,000$ Monte Carlo samples. Coverage probabilities are reported together with those of the oracle confidence sets for known $\operatorname{\boldsymbol{\delta}}^{v}$ and $\lambda$ . Since the simulation is carried out in the conditional setting, $\operatorname{\boldsymbol{\delta}}^{v}$ is the adequate oracle for the marginal set as well. By construction, $\hat{\operatorname{\boldsymbol{\delta}}}$ is a consistent estimator thereof. The relative average volume of the confidence sets to the volume of the marginal set is given in brackets. Recall that the asymptotic behavior relies on $m$ .

Table 1 gives the empirical coverage of the corresponding confidence sets for the nominal level $1-\alpha=0.95$ . The two most right columns compare the confidence sets based on $\operatorname{\mathcal{M}_{\alpha}}$ and $\operatorname{\mathcal{C}_{\alpha}}$ constructed as outlined above when $\hat{\operatorname{\boldsymbol{\delta}}}$ is obtained by REML. The other three columns, in the center of the same table, are given for comparison: The exact confidence set is the ‘Conditional: known $\lambda$ , $\operatorname{\boldsymbol{\delta}}^{v}$ ’. Here, nominal level is readily attained. The coverage of the confidence set ‘Marginal: known $\operatorname{\boldsymbol{\delta}}^{v}$ ’ exhibits the error solely due to the misspecification in using the marginal set for the conditional setting, as described in Theorem 3. Despite its error rate $O(m^{-1/2})$ , the empirical coverage is so close to $1-\alpha$ , that it cannot be seen in Table 1. Comparing the two ‘Conditional’ columns with oracle parameters reveals the impact of estimating $\lambda$ ; see the Supplement for a deeper analysis of the reliability of the estimation of $\lambda$ .

Clearly, the coverage probabilities improve for larger $m$ and/or $n_{i}$ . This is in line with the theoretical findings. However, the coverage error is superimposed by the shape of the ICC. If it is close to $1$ , the REML estimates are stable, and similarly $\hat{\lambda}$ . The ICC is influenced by two drivers: Firstly, by the relative size of $\sigma_{v}^{2}$ to $\sigma_{e}^{2}$ . If $\sigma_{v}^{2}$ is large, the empirical coverage is closest to the nominal level. This has already been observed for individual confidence intervals (Das et al.,, 2004). Secondly, the ICC relies on the size of $n_{i}$ . Irrespective of the reliability of the REML estimates, a large $n_{i}$ results in accurate coverage probabilities, as can be seen on the last row for each configuration of $(\sigma_{v}^{2},\sigma_{e}^{2})$ . Conversely, even for known $\operatorname{\boldsymbol{\delta}}^{v}$ , a small $n_{i}$ may cause in a severe under-coverage. All these effects shape the performance of the REML based confidence sets: it is evident that the asymptotic behavior cannot be observed when $n_{i}$ and $\sigma_{v}^{2}$ are small compared to the noise level $\sigma_{e}^{2}$ . For more discussion see the Supplement.

Finally we consider the test $H_{0}:\operatorname{\boldsymbol{\mu}}=\mathbf{a}$ vs. $H_{1}:\operatorname{\boldsymbol{\mu}}=\mathbf{a}+\mathbf{1}_{m}\Delta$ , $\mathbf{a}\in\mathbb{R}^{m}$ with $\Delta\in\mathbb{R}$ . Power functions studying the error of the second kind for different parameters $m$ and $n_{i}$ are given in Figure 3 for different ICC. Unsurprisingly, the power growths steeper for larger $m$ and $n_{i}$ , but again is sensitive to the relative size of $\sigma_{v}^{2}$ to $\sigma_{e}^{2}$ . The power of the tests based on marginal sets (solid line) is notably steeper than the slope of the power based on conditional sets (dashed).

All in all, both the conditional and marginal sets exhibit similar coverage probabilities, which is in line with the theoretical findings. However, due to its simpler construction and broader application, the results of both Table 1 and Figure 3 favor the use of marginal confidence sets, especially for testing.

4 Study on Covid-19 Mortality in US State Prisons

The methods introduced above are applied to Covid-19 related mortality rates in US state prisons between March 2020 until the end of March 2021, published by New York Times,. The data are from $n=494$ US state prisons of $m=45$ states which form the clusters. The model is $y_{ij}=\beta_{0}+x_{ij}\beta_{1}+v_{i}+e_{ij},$ where $y_{ij}$ is the log-mortality for prison $j$ in state $i$ and $x_{ij}$ the standardized county log-mortality in which the prison is located. The covariates account for local effects on mortality, while the error terms account for the plethora of unobserved variables. The random effect $v_{i}$ describes the remaining state effect on mortality. The number of prisons $n_{i}$ in each state ranges from $1$ to $46$ , with a median of $8$ . The use of direct estimators is unreliable due to the small number of observations per state, so that the LMM modeling device is appealing. However, following the logic of our above discussions, the inference will be performed conditionally on $v_{i}$ . The parameter of interest $\mu_{i}=\beta_{0}+\sum_{j=1}^{n_{i}}x_{ij}\beta_{1}/n_{i}+v_{i}$ , the mean log-mortality in prisons per state (subsequently “mortality”), is estimated via the EBLUP $\widehat{\mu}_{i}$ . The fixed effects are estimated as $(\widehat{\beta}_{0},\widehat{\beta}_{1})\approx(-4.79,0.03)$ and the variance components via REML as $\widehat{\sigma}_{v}^{2}\approx 0.43$ and $\widehat{\sigma}_{e}^{2}\approx 0.86$ . In sum, the setting is similar to those analyzed in our simulation studies. Furthermore, the specification, including the normality assumption for model (1), is graphically assessed by residual analyses in the Supplement. The residuals show neither any anomalies, nor a violation of our model assumptions. The estimates $\widehat{\mu}_{i}$ are visualized in Figure 4.

First, we state the hypothesis that the state effect is due to Covid-19 related policies. An interesting question could be if mortality in democratic governed states is lower than in republican ones. Formally, let $\mu_{R}$ be the mortality for all $22$ states governed by republicans and $\mu_{D}$ for all $23$ democratic ones, for which data is available. The corresponding t-test to $H_{0}$ : $\mu_{R}\leq\mu_{D}$ vs. $H_{1}$ : $\mu_{R}>\mu_{D}$ using direct estimates for the two types of states, rejects the null for common significance levels with a p-value of $P_{H_{0}}(T>t)\approx 10^{-8}$ . However, the above t-test supposes that observations given the same party come all from a distribution with the same mean, i.e., there were no systematic differences in mortality within democratic or republican states, respectively. This can be checked using a linear hypothesis test as described in Section 2. Formally, interest lies in verifying the hypothesis that groups of states share the same state effect. For a group of $u+1$ states, let $\mathbf{L}=(\boldsymbol{0}_{u},\dots,\mathbf{L}^{\ast},\boldsymbol{0}_{u},\dots)$ be $(u\times 45)$ , with $\mathbf{L}^{\ast}=\big{(}\mathbf{I}_{u},\boldsymbol{0}_{u}\big{)}-\mathbf{1}_{u}\mathbf{1}_{u+1}^{t}/(u+1)$ corresponding to the states of interest. We test the hypothesis $H_{0}:\mathbf{L}\operatorname{\boldsymbol{\mu}}=\boldsymbol{0}_{u}$ against $H_{1}:\mathbf{L}\operatorname{\boldsymbol{\mu}}\neq\boldsymbol{0}_{u}$ . As described in Section 2, this tests whether all states in the considered group share an equal state mean. If the group consists of the first $u+1$ states, the null hypothesis is equivalent to $H_{0}:\mu_{1}=\mu_{2}=\dots=\mu_{u+1}$ . The result for two such tests for equality for democratic and republican governed states respectively, are given in Table 2. The table reports the rank of $\mathbf{L}$ , i.e., the rate at which the tests are consistent, the value of the quadratic form as ‘Pivot’ together with the corresponding quantile and p-value and – for the conditional case – the estimated non-centrality parameter $\hat{\lambda}_{\mathbf{L}}$ .

At common significance levels, both tests reject the hypothesis that the mortality is equal in all democratic or republican states, respectively, which conflicts the assumptions of the above t-test. It is however noteworthy, that the $p$ -values of the marginal and conditional set are noticeably different. This is due to the estimate obtained for the non-centrality parameter, on which the conditional confidence set relies. The estimator for the non-centraility parameter $\lambda$ works well in balanced panels, as can be seen in Table 1. In the present data set, however, the panel is unbalanced with the large proportion of states having few observations. Moreover, the large error variance $\sigma_{e}^{2}$ in relation to $\sigma_{v}^{2}$ in such unbalanced panels with small sample sizes is known to lead to very unreliable estimators for $\sigma_{e}^{2}$ and $\sigma_{v}^{2}$ and herewith for $\lambda_{\mathbf{L}}$ , especially if $m$ is not too large. This is confirmed by parametric bootstrap estimates for $\lambda_{\mathbf{L}}$ in all tests of Tables 2 and 3, provided in the Supplement. The variable estimator $\lambda_{\mathbf{L}}$ causes the conditional confidence set to be unreasonably large, resulting in overcoverage. This claim is verified in the last line of Table 6 in the Supplement, which replicates exactly the setting corresponding to our real data example. At the same time, the results shown in the last line of Table 6 confirm that the marginal sets perform excellent in the given parameter constellation. Moreover, if one is interested in other than the previously considered groups, the conditional approach requires to re-estimate the non-centrality parameter on each new subset of interest. These aspects enhance the value of Theorem 3 and give a strong support for application of the marginal set in practice.

Instead of looking at political party effects, one may look at geographic effects, and check if among certain groups of states their mortality is equal. We repeat the above test for groups formed by the four regions of the US census bureau. The results are given in Table 3. For common significance levels, the tests reject the null hypothesis for the census regions Midwest, South and West. For the census region Northeast, the null hypothesis cannot be rejected. Potentially, this is because the state policies are homogeneous within this census region. Again, the influence of the non-centrality parameter can be observed for Northeast, even though it may not make a difference for the conclusion as the marginal and conditional tests give the same results for significance levels $\alpha=0.01$ , $0.05$ , and $0.1$ . For the southern census region, $10$ of $15$ of all individual tests do not reject $H_{0}$ , and neither would a joint test with Bonferroni correction. In fact, the latter is true for all census regions except Northeast. This illustrates that our multiple test represent an important complement to the existing single ones, combined or not with Bonferroni.

Certainly, the above illustration gives just some particular examples, but it is obvious that any other linear hypotheses with $u\leq m$ could be tested analogously. We believe that such tests are highly relevant, insightful and helpful in practice. One can also use the confidence sets to see in which clusters one needs to change how much in order to eliminate significant differences, employing thereby our tools for policy makers.

5 Discussion

Under assumption (A1) inference based on predictors for single clusters is intractable under conditional law due to the bias. This is the reason why single cluster inference has only been performed under the marginal framework. As shown in Proposition 1, the inference for the individual mixed parameter holds on average only. In this work we derived joint confidence sets for mixed parameters $\mu_{1},\ldots,\mu_{m}$ in LMMs under both, marginal and conditional law. The latter requires the estimation of a non-centrality parameter of the respective $\chi^{2}(\lambda)$ -distribution. We have shown that with its estimate, the desired nominal coverage is attained at the usual parametric rate. To the best of our knowledge, our method allows for inference on multiple clusters under the conditional law for the first time. In particular, it lends itself to infer on a subset of clusters of interest, as illustrated in the study on the Covid-19 mortality in US state prisons. Further, we show that, surprisingly and in contrast to cluster-wise confidence intervals, the joint (or multiple) confidence sets built under marginal law are approximately valid at the same parametric rate when conditioning on the clusters. A simulation study confirms this effect already for samples of small and moderate size. Our results hold for all kind of linear combinations of mixed parameters $\mu_{i}$ of a cluster $i$ .

The order of the derived error relies on the normality assumption in (1). If no distributional assumption is justified, additional regularity conditions governing the boundedness of higher moments have to be imposed, and resampling methods could be applied. Moreover, simulations carried out for non-Gaussian random effects, shown in the Supplement, indicate the robustness of the proposed confidence sets. Furthermore, when it is of interest to test linear contrasts of mixed parameters, we extend our test to cover multiple comparisons by Tukey’s method, see the Supplement. However, the application of this method is limited to special cases where the corresponding bias can be shown to be negligible, and the considered subset of pairwise differences falls exactly into the class of Tukey’s testing problem. Finally, we expect that generally, our methods and results can be extended to other predictors of LMMs, such as the empirical best predictor of Jiang et al., (2011).

Acknowledgments

The authors thank Domingo Morales, Carmen Cadarso-Suárez, Jiming Jiang, María-José Lombardía and Wenceslao González-Manteiga for helpful discussion. This work has been carried out while the first two authors were employed at the University of Göttingen, Germany. They also acknowledge the funding by the German Research Association (DFG) via Research Training Group 1644 “Scaling Problems in Statistics”; the last author acknowledges financial support from the Swiss National Science Foundation, project 200021-192345.

Appendix

Asymptotic scenarios beyond (A1)

Although the results are derived under (A1), they are not restricted to such an asymptotic scenario. Under the marginal law, $\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})=\mathbf{0}_{m}$ , but under (A1), $\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})\nrightarrow\mathbf{0}_{m}$ under the conditional law. If (A1) were to be relaxed, and both $m\rightarrow\infty$ and all $n_{i}\rightarrow\infty$ , the EBLUP would be consistent under both probability measures, and both, marginal and conditional, collapse into one. To investigate the effect of unbounded $n_{i}$ on Theorems 1-3, the source of error terms becomes crucial. The error term in Theorem 1 is due to $\hat{\operatorname{\boldsymbol{\delta}}}=\operatorname{\boldsymbol{\delta}}+\{O(m^{-1/2})\}_{r}$ while the one in Theorem 2 is due to $\hat{\operatorname{\boldsymbol{\delta}}}=\operatorname{\boldsymbol{\delta}}^{v}+\{O(m^{-1/2})\}_{r}$ and the estimation of $\lambda$ . As the EBLUP is consistent under conditional law if $n_{i}\rightarrow\infty$ for all $i$ , the non-centrality parameter vanishes in such cases, i.e., $\lambda\rightarrow 0$ . The same holds for Theorem 3. Technically, the cases for unbounded $n_{i}$ differ from (A1) as the leading entries on the diagonal of $\operatorname{\boldsymbol{\Sigma}}$ and $\operatorname{\boldsymbol{\Sigma}}_{v}$ vanish: $(\operatorname{\boldsymbol{\Sigma}})_{ii}=O(n_{i}^{-1})$ and $(\operatorname{\boldsymbol{\Sigma}}_{v})_{ii}=O(n_{i}^{-1})$ . In order to assess which asymptotic behavior (of the diagonal entries or $\hat{\operatorname{\boldsymbol{\delta}}}$ ) determines the rate of the error term in each theorem, it is required to fix the relation of $m$ and $n_{i}$ , $i=1,\dots,m$ .

Under the asymptotic scenario $m\rightarrow\infty$ and all or some $n_{i}\rightarrow\infty$ , the stated results still hold, and the error rates can improve. This depends intricately on the number of unbounded cluster sample sizes and the rate at which they grow. If the number of clusters with bounded sample sizes is itself unbounded, that is $O(m)$ , the error rates generally fall back on what is stated in Theorems 1-3. If it is bounded, a toy example shows that they can improve. Set the sample size of a single cluster as fixed and let all other cluster sample sizes grow at the same rate as the number of clusters $m$ . That is, $n_{1}=O(1)$ , $m\rightarrow\infty$ and $m/n_{i}\rightarrow 1$ , for $i=2,\dots,m$ . Then, by Lemma 1 and the proof of Theorem 1, $m^{-1/2}\sum_{i=1}^{m}(\hat{\mu}_{i}-\mu_{i})^{2}=O_{p}(m^{-1/2})$ , so that the error in Theorem 1 is reduced to $O(m^{-3/2})$ .

Proofs

The proofs are given in two parts. First, the order of the bias of the covariance matrix estimator is established in Lemmas 1 and 2. The former is omitted for brevity, the latter given in the Supplement. Both rely on Taylor approximations, similar to Prasad and Rao, (1990) and Datta and Lahiri, (2000). The difficulty in Lemma 2 lays in the nature of $\operatorname{\boldsymbol{\delta}}^{v}$ and decomposition (6), a multitude of additional terms have to be evaluated. In the second part of proofs it is shown that the resulting error rate is preserved in the evaluations that lead to Theorems 1-3. Since both the dimension of the covariance matrix estimator as well as the error rate are given in terms of $m$ , this has to be carefully addressed in matrix inversion, the quadratic form, and the final probabilistic statement.

Proof for Theorem 1

Proof.

Let $(\operatorname{\boldsymbol{\Sigma}})_{ik}=\sigma_{ik}$ , $\{\widehat{\operatorname{\boldsymbol{\Sigma}}}(\hat{\operatorname{\boldsymbol{\delta}}})\}_{ik}=\hat{\sigma}_{ik}$ and $\{\widehat{\operatorname{\boldsymbol{\Sigma}}}(\operatorname{\boldsymbol{\delta}})\}_{ik}=\tilde{\sigma}_{ik}$ . We first show that

[TABLE]

By Lemma 1, $\mbox{E}(\hat{\sigma}_{ik})=\sigma_{ik}+O_{p}(m^{-3/2})$ , as well as $\tilde{\sigma}_{ik}=\sigma_{ik}+O_{p}(m^{-3/2})$ . Note that $\hat{\delta}_{e}-\delta_{e}=O_{p}(m^{-1/2})$ . Further, $\tilde{\sigma}_{ii}=O(1)$ as well as $\tilde{\sigma}_{ik}=O(m^{-1})$ for $i\neq k$ and this order is preserved for its derivatives with respect to $\operatorname{\boldsymbol{\delta}}$ . Thus,

[TABLE]

By Chebychevs inequality, for a random variable $X$ with finite variance $X=\mbox{E}(X)+O_{p}\{\sqrt{\mbox{Var}(X)}\}$ . It follows that $\widehat{\operatorname{\boldsymbol{\Sigma}}}=\operatorname{\boldsymbol{\Sigma}}-\operatorname{\mathbf{C}}$ where

[TABLE]

It is now shown that inverting preserves the error. Let $\operatorname{\mathbf{D}}=(\operatorname{\mathbf{d}}_{1},\dots,\operatorname{\mathbf{d}}_{m})$ for $\operatorname{\mathbf{d}}_{i}$ as in (B2) and note that $(\operatorname{\mathbf{X}}^{t}\operatorname{\mathbf{V}}^{-1}\operatorname{\mathbf{X}})^{-1}=\{O_{p}(m^{-1})\}_{p\times p}$ . The matrix inversion formula yields

[TABLE]

Thus, $\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1}=\mbox{diag}[\{O_{p}(m^{-1/2})\}_{m}]+\{O_{p}(m^{-3/2})\}_{m\times m}$ . Denote $\lambda_{\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1}}$ as largest eigenvalue of $\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1}$ . With the column-sum norm, $\lambda_{\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1}}\leq\max_{k=1,\dots,m}\sum_{i=1}^{m}|\{\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1}\}_{ik}|=O(m^{-1/2})<1$ for large $m$ . Writing the inverse as Neumann-series, $(\operatorname{\mathbf{I}}_{m}-\operatorname{\mathbf{C}}\operatorname{\boldsymbol{\Sigma}}^{-1})^{-1}=\operatorname{\mathbf{I}}_{m}+\mbox{diag}[\{O_{p}(m^{-1/2})\}_{m}]+\{O_{p}(m^{-3/2})\}_{m\times m}$ . Now

[TABLE]

Eventually, since $m^{-1/2}\sum_{i=1}^{m}(\hat{\mu}_{i}-\mu_{i})^{2}=O_{p}(m^{1/2})$ and $Q=\|\operatorname{\boldsymbol{\Sigma}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}/m=O_{p}(1)$ , putting all parts together gives (13). Further, let $U=\|\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}/m-\|\operatorname{\boldsymbol{\Sigma}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}/m=O_{p}(m^{-1/2})$ with probability density function $f_{U}$ and let $z=m^{-1}\chi^{2}_{m,1-\alpha}=O(1)$ , such that

[TABLE]

which concludes the proof. ∎

Proof and Definitions for Theorem 2

First, define $\mathbf{w}_{i}=(\mathbf{b}_{i}^{t}\mathbf{Z}_{i}-\mathbf{h}_{i}^{t})\mathbf{J}_{i}+\mathbf{d}_{i}^{t}(\operatorname{\mathbf{X}}^{t}\operatorname{\mathbf{V}}^{-1}\operatorname{\mathbf{X}})^{-1}\operatorname{\mathbf{X}}^{t}\operatorname{\mathbf{V}}^{-1}\in\mathbb{R}^{n}$ , $n=\sum_{i=1}^{m}n_{i}$ , so that $\mathbf{w}_{i}^{t}\mathbf{e}=\tilde{\mu}_{i}-\mbox{E}(\tilde{\mu}_{i}|\mathbf{v})$ . Let ${\mathbf{L}}_{3}^{\ast}(\operatorname{\boldsymbol{\delta}}^{v})=\mbox{Cov}\left(\hat{\operatorname{\boldsymbol{\mu}}}-\tilde{\operatorname{\boldsymbol{\mu}}},\tilde{\operatorname{\boldsymbol{\mu}}}\big{|}\mathbf{v}\right)$ and $\widehat{\mathbf{L}}_{3}(\operatorname{\boldsymbol{\delta}}^{v})=\widehat{\mathbf{L}}_{3}^{\ast}(\operatorname{\boldsymbol{\delta}}^{v})+\widehat{\mathbf{L}}_{3}^{\ast}(\operatorname{\boldsymbol{\delta}}^{v})^{t}$ . If $\operatorname{\boldsymbol{\delta}}^{v}$ is estimated via

(i)

REML, given $\mathbf{P}=\operatorname{\mathbf{V}}^{-1}-\operatorname{\mathbf{V}}^{-1}\operatorname{\mathbf{X}}(\operatorname{\mathbf{X}}^{t}\operatorname{\mathbf{V}}^{-1}\operatorname{\mathbf{X}})^{-1}\operatorname{\mathbf{X}}^{t}\operatorname{\mathbf{V}}^{-1}$ , then

[TABLE] 2. (ii)

Henderson III, then

[TABLE]

Both estimators have entries of order $O(m^{-1})$ . Their are derived in analogy to $\widehat{\mathbf{L}}_{4}(\operatorname{\boldsymbol{\delta}}^{v})$ , which is outlined in the Supplement.

Proof.

With Lemma 2 the proof for Theorem 1 can be replicated giving

[TABLE]

It thus remains to show that

[TABLE]

Examining the entries of both $\mathbf{A}$ and $\operatorname{\boldsymbol{\Sigma}}_{v}^{-1}$ gives, using the decomposition of the proof of Theorem 1, $\widehat{\operatorname{\boldsymbol{\Sigma}}}_{v}^{-1}=\operatorname{\boldsymbol{\Sigma}}_{v}^{-1}+\mathbf{B}$ , for $\mathbf{B}=\text{diag}[\{O(m^{-1/2})\}_{n_{i}\times n_{i}}]+\{O(m^{-3/2})\}_{n\times n}$ , that $\mathbf{A}^{t}\operatorname{\boldsymbol{\Sigma}}_{v}^{-1}\mathbf{A}=\text{diag}[\{O(1)\}_{n_{i}\times n_{i}}]+\{O(m^{-1})\}_{n\times n}$ , so that

[TABLE]

Using (C1) and (C2), so that $\hat{\operatorname{\boldsymbol{\delta}}}=\operatorname{\boldsymbol{\delta}}^{v}+\{O_{p}(m^{-1/2})\}_{r}$ as given by Lemma 4 in the Supplement and putting all parts together gives

[TABLE]

This error rate is sufficient as the estimator effectively contributes as $\hat{\lambda}/m$ in $\chi^{2}_{m}(\hat{\lambda})$ . Now we show that $\tilde{\lambda}=\tilde{\lambda}(\operatorname{\boldsymbol{\Sigma}}_{v},\tilde{\operatorname{\boldsymbol{\beta}}},\operatorname{\boldsymbol{\delta}}^{v})=\lambda+O_{p}(m^{1/2})$ by considering its expectation and variance.

[TABLE]

using (C1) and (C2). Similarly, $\text{Var}(\tilde{\lambda}|\operatorname{\mathbf{v}})=O(m)$ . Hence, $\tilde{\lambda}=\lambda+O_{p}(m^{1/2})$ . Eventually,

[TABLE]

This concludes the proof. ∎

Proof for Theorem 3

Another way to obtain a pivotal for multiple inference is to evaluate the distribution of the quadratic form $Q=\|\operatorname{\boldsymbol{\Sigma}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}$ under the conditional law. It is distributed as generalized non-central $\chi^{2}$ , and thus has no analytically tractable probability density function. However, due to the linearity of $\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}$ in $\mathbf{v}$ , the quadratic form $Q$ can be suitably split up in treatable terms.

Proof.

Due to linearity of $\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}$ , it holds that $\operatorname{\boldsymbol{\Sigma}}=\operatorname{\boldsymbol{\Sigma}}_{v}+\operatorname{\boldsymbol{\Sigma}}_{b}$ , where $\operatorname{\boldsymbol{\Sigma}}_{b}=\mbox{Cov}(\operatorname{\boldsymbol{\mu}}_{b})$ for $\operatorname{\boldsymbol{\mu}}_{b}=\mbox{E}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}}|\mathbf{v})$ by the law of total variance. Moreover,

[TABLE]

where $\mathbf{T}_{c}^{-1}$ fulfills $\operatorname{\boldsymbol{\Sigma}}_{v}\mathbf{T}_{c}^{-1}=\operatorname{\boldsymbol{\Sigma}}_{b}\operatorname{\boldsymbol{\Sigma}}^{-1}$ . Now consider $Q=S+R$ with

[TABLE]

It holds that $S|\mathbf{v}\sim\chi^{2}_{m}$ . Next, we show that $R$ is of lower order compared to $S$ . Let $\mathbf{W}\in\mathbb{R}^{m\times mq}$ such that $\operatorname{\boldsymbol{\mu}}_{b}=\mathbf{W}\operatorname{\mathbf{v}}$ . Note that $\mathbf{W}=\text{diag}[\{O(1)\}_{1\times q}]_{m}+\{O(m^{-1})\}_{m\times mq}$ .

[TABLE]

by the same reasoning as in the proof of Lemma 4 in the Supplement. Similarly, $\text{Var}(R|\operatorname{\mathbf{v}})=O(m)$ . Hence, $R=\mbox{E}(R|\operatorname{\mathbf{v}})+O_{p}\{\sqrt{\mbox{Var}(R|\operatorname{\mathbf{v}})}\}=O_{p}(m^{1/2})$ . Now, using that $S=O_{p}(m)$ ,

[TABLE]

Replacing $\operatorname{\boldsymbol{\Sigma}}$ in $Q$ by $\widehat{\operatorname{\boldsymbol{\Sigma}}}=\operatorname{\boldsymbol{\Sigma}}+\{O_{p}(m^{-1/2})\}_{m\times m}$ gives $\|\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}/m=Q/m+O_{p}(m^{-1/2})$ as in the proof of Theorem 1. The order of the error coincides with the one in above equation, which gives $\mbox{P}(\|\widehat{\operatorname{\boldsymbol{\Sigma}}}^{-1/2}(\hat{\operatorname{\boldsymbol{\mu}}}-\operatorname{\boldsymbol{\mu}})\|^{2}<\chi^{2}_{m,1-\alpha}|\mathbf{v})=1-\alpha+O(m^{-1/2})$ . ∎

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Battese et al., (1988) Battese, G. E., Harter, R. M., and Fuller, W. A. (1988). An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association , 83:28–36.
2Chatterjee et al., (2008) Chatterjee, S., Lahiri, P., and Li, H. (2008). Parametric Bootstrap Approximation to the Distribution of EBLUP and Related Prediction Intervals in Linear Mixed Models. The Annals of Statistics , 36(3):1221–1245.
3Das et al., (2004) Das, K., Jiang, J., and Rao, J. N. K. (2004). Mean Squared Error of Empirical Predictor. The Annals of Statistics , 32(2):828–840.
4Datta et al., (2002) Datta, G. S., Gosh, M., Smith, D. D., and Lahiri, P. (2002). On the Asymptotic Theory of Conditional and Unconditional Coverage Probabilities of Empirical Bayes Confidence Intervals. Scandinavian Journal of Statistics , 29:139–152.
5Datta and Lahiri, (2000) Datta, G. S. and Lahiri, P. (2000). A Unified Measure of Uncertainty of Estimated Best Linear Predictors in Small Area Estimation Problems. Statistica Sinica , 10:613–627.
6Demidenko, (2004) Demidenko, E. (2004). Mixed Models: Theory and Applications . Wiley Series in Probability and Statistics, Hoboken, NJ.
7Francq et al., (2019) Francq, B. G., Lin, D., and Hoyer, W. (2019). Confidence, prediction, and tolerance in linear mixed models. Statistics in Medicine .
8Ganesh, (2009) Ganesh, N. (2009). Simultaneous Credible Intervals for Small Area Estimation Problems. Journal of Multivariate Analysis , 100(8):1610–1621.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Marginal and Conditional Multiple Inference

Abstract

1 Introduction

2 Confidence Stets for Multiple Inference

Marginal Simultaneous Prediction Sets

Lemma 1**.**

Theorem 1**.**

Conditional Simultaneous Confidence Sets

Lemma 2**.**

Theorem 2**.**

Theorem 3**.**

Conditional Multiple Testing

3 Simulation Examples and Performance Study

Proposition 1**.**

4 Study on Covid-19 Mortality in US State Prisons

5 Discussion

Acknowledgments

Appendix

Asymptotic scenarios beyond (A1)

Proofs

Proof for Theorem 1

Proof.

Proof and Definitions for Theorem 2

Proof.

Proof for Theorem 3

Proof.

Lemma 1.

Theorem 1.

Lemma 2.

Theorem 2.

Theorem 3.

Proposition 1.