Estimating the Random Effect in Big Data Mixed Models

Michael Law; Ya'acov Ritov

arXiv:1907.11958·math.ST·July 30, 2019

Estimating the Random Effect in Big Data Mixed Models

Michael Law, Ya'acov Ritov

PDF

Open Access

TL;DR

This paper develops new statistical methods for high-dimensional Gaussian linear mixed models, including tests, confidence intervals, and estimators for random effects, with applications to educational data analysis.

Contribution

It introduces an asymptotic F-test, confidence interval, and empirical Bayes estimator for random effects in high-dimensional mixed models, without fixed effect design assumptions.

Findings

01

The F-test asymptotically controls type I error.

02

The confidence interval achieves parametric rate $\

03

,

Abstract

We consider three problems in high-dimensional Gaussian linear mixed models. Without any assumptions on the design for the fixed effects, we construct an asymptotic $F$ -statistic for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate $n$ , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data.

Tables4

Table 1. Table 1 : Simulations when σ ν 2 = 0 superscript subscript 𝜎 𝜈 2 0 \sigma_{\nu}^{2}=0 and σ γ 2 = 0 superscript subscript 𝜎 𝛾 2 0 \sigma_{\gamma}^{2}=0

	$ρ$	0	0	0	0	0.8	0.8	0.8	0.8
	$v$	25	25	50	50	25	25	50	50
	$r$	25	50	25	50	25	50	25	50
	$F_{LD}$	92	91	94	98	95	95	99	93
AveCov	$F_{LS}$	92	91	95	98	96	95	97	94
	$F_{EW}$	91	92	95	84	93	100	82	99
	${\hat{σ}}_{ν, LD}^{2}$	97	99	99	100	98	96	99	97
AveCov	${\hat{σ}}_{ν, LS}^{2}$	100	100	98	98	100	100	99	99
	${\hat{σ}}_{ν}^{2}$	100	100	100	99	100	100	97	100
	${\hat{σ}}_{ν, LD}^{2}$	0.33	0.32	0.31	0.34	0.33	0.31	0.36	0.35
AveLen	${\hat{σ}}_{ν, LS}^{2}$	0.12	0.16	0.26	0.43	0.13	0.15	0.25	0.35
	${\hat{σ}}_{ν}^{2}$	0.12	0.17	0.29	0.84	0.12	0.13	0.26	0.34
	${\hat{η}}_{oracle}$	0	0	0	0	0	0	0	0
AveLoss	${\hat{η}}_{LS}$	0.02	0.02	0.03	0.03	0.02	0.02	0.02	0.03
	${\hat{η}}_{EW}$	0.07	0.15	0.23	1.00	0.32	0.32	0.32	0.39

Table 2. Table 2 : Simulations when σ ν 2 = 0 superscript subscript 𝜎 𝜈 2 0 \sigma_{\nu}^{2}=0 and σ γ 2 = 1 superscript subscript 𝜎 𝛾 2 1 \sigma_{\gamma}^{2}=1

	$ρ$	0	0	0	0	0.8	0.8	0.8	0.8
	$v$	25	25	50	50	25	25	50	50
	$r$	25	50	25	50	25	50	25	50
	$F_{LD}$	95	96	96	94	94	93	95	97
AveCov	$F_{LS}$	96	96	97	94	95	92	97	98
	$F_{EW}$	95	98	95	97	93	99	99	99
	${\hat{σ}}_{ν, LD}^{2}$	97	99	100	96	97	96	96	99
AveCov	${\hat{σ}}_{ν, LS}^{2}$	100	100	100	96	100	100	100	100
	${\hat{σ}}_{ν}^{2}$	100	100	99	98	100	100	99	99
	${\hat{σ}}_{ν, LD}^{2}$	0.34	0.35	0.4	0.4	0.34	0.38	0.39	0.39
AveLen	${\hat{σ}}_{ν, LS}^{2}$	0.20	0.16	0.22	0.32	0.12	0.16	0.30	0.36
	${\hat{σ}}_{ν}^{2}$	0.19	0.17	0.22	0.55	0.12	0.14	0.28	0.35
	${\hat{η}}_{oracle}$	0	0	0	0	0	0	0	0
AveLoss	${\hat{η}}_{LS}$	0.04	0.03	0.04	0.04	0.04	0.04	0.04	0.05
	${\hat{η}}_{EW}$	0.12	0.28	0.19	1.07	0.41	0.39	0.39	0.38

Table 3. Table 3 : Simulations when σ ν 2 = 1 superscript subscript 𝜎 𝜈 2 1 \sigma_{\nu}^{2}=1 and σ γ 2 = 0 superscript subscript 𝜎 𝛾 2 0 \sigma_{\gamma}^{2}=0

	$ρ$	0	0	0	0	0.8	0.8	0.8	0.8
	$v$	25	25	50	50	25	25	50	50
	$r$	25	50	25	50	25	50	25	50
	$F_{LD}$	100	100	100	100	100	100	100	100
AveCov	$F_{LS}$	100	100	100	100	100	100	100	100
	$F_{EW}$	100	100	98	97	100	100	100	100
	${\hat{σ}}_{ν, LD}^{2}$	96	95	96	95	95	94	96	96
AveCov	${\hat{σ}}_{ν, LS}^{2}$	87	92	93	91	91	85	94	96
	${\hat{σ}}_{ν}^{2}$	80	87	89	95	62	42	51	73
	${\hat{σ}}_{ν, LD}^{2}$	0.64	0.66	0.52	0.52	0.67	0.63	0.52	0.53
AveLen	${\hat{σ}}_{ν, LS}^{2}$	1.22	1.34	1.12	1.31	1.32	1.19	1.05	1.27
	${\hat{σ}}_{ν}^{2}$	1.10	1.26	1.18	1.70	0.92	0.83	0.82	1.02
	${\hat{η}}_{oracle}$	0.11	0.12	0.19	0.19	0.11	0.11	0.19	0.19
AveLoss	${\hat{η}}_{LS}$	0.14	0.14	0.22	0.22	0.14	0.13	0.22	0.21
	${\hat{η}}_{EW}$	0.25	0.54	0.70	1.07	0.41	0.41	0.51	0.44

Table 4. Table 4 : Simulations when σ ν 2 = 1 superscript subscript 𝜎 𝜈 2 1 \sigma_{\nu}^{2}=1 and σ γ 2 = 1 superscript subscript 𝜎 𝛾 2 1 \sigma_{\gamma}^{2}=1

	$ρ$	0	0	0	0	0.8	0.8	0.8	0.8
	$v$	25	25	50	50	25	25	50	50
	$r$	25	50	25	50	25	50	25	50
	$F_{LD}$	100	100	100	100	100	100	100	100
AveCov	$F_{LS}$	100	100	100	100	100	100	100	99
	$F_{EW}$	99	100	100	94	100	100	100	98
	${\hat{σ}}_{ν, LD}^{2}$	93	98	98	95	97	96	95	96
AveCov	${\hat{σ}}_{ν, LS}^{2}$	89	95	95	94	87	88	93	88
	${\hat{σ}}_{ν}^{2}$	81	82	87	79	59	43	66	57
	${\hat{σ}}_{ν, LD}^{2}$	0.68	0.68	0.69	0.55	0.68	0.68	0.56	0.57
AveLen	${\hat{σ}}_{ν, LS}^{2}$	1.31	1.32	1.15	1.12	1.22	1.28	1.13	1.19
	${\hat{σ}}_{ν}^{2}$	1.20	1.17	1.10	1.20	0.88	0.87	0.89	0.93
	${\hat{η}}_{oracle}$	0.13	0.14	0.22	0.24	0.14	0.14	0.22	0.23
AveLoss	${\hat{η}}_{LS}$	0.18	0.18	0.27	0.28	0.19	0.19	0.26	0.28
	${\hat{η}}_{EW}$	0.39	0.41	0.46	0.97	0.50	0.52	0.52	0.59

Equations354

Y = μ + Z ν + W γ + ε,

Y = μ + Z ν + W γ + ε,

Y = X β + Z ν + W γ + Δ + ε .

Y = X β + Z ν + W γ + Δ + ε .

S_{μ} ≜ {m \in M_{s} : P_{m}^{⊥} μ^{2} = o (k)}

S_{μ} ≜ {m \in M_{s} : P_{m}^{⊥} μ^{2} = o (k)}

H_{0} : λ_{max} (Ψ) = 0

H_{0} : λ_{max} (Ψ) = 0

F = \frac{A ~ Y ^{2} / n _{\tilde{a}}}{B ~ Y ^{2} / n _{\tilde{b}}} .

F = \frac{A ~ Y ^{2} / n _{\tilde{a}}}{B ~ Y ^{2} / n _{\tilde{b}}} .

\tilde{A} Y = \tilde{A} Z ν + \tilde{A} ε

\tilde{A} Y = \tilde{A} Z ν + \tilde{A} ε

\tilde{B} Y = \tilde{B} ε .

A Y = A X β + A Z ν + A ε,

A Y = A X β + A Z ν + A ε,

B Y = B X β + B ε,

Q ≜ (A B) .

Q ≜ (A B) .

(A Y B Y) = Q Y = QX β + QZ ν + Qε = (A X β + A Z ν + A ε B X β + B ε) .

(A Y B Y) = Q Y = QX β + QZ ν + Qε = (A X β + A Z ν + A ε B X β + B ε) .

(A Y - A X \hat{β} B Y - B X \hat{β})

(A Y - A X \hat{β} B Y - B X \hat{β})

= (A X β - A X \hat{β} B X β - B X \hat{β}) + (A Z ν + A ε B ε) .

w_{m, Q} ≜ \frac{exp ( - \frac{1}{α} Q Y - QX β ^ _{m, Q} ^{2} )}{\sum _{k \in M_{u}} exp ( - \frac{1}{α} Q Y - QX β ^ _{k, Q} ^{2} )},

w_{m, Q} ≜ \frac{exp ( - \frac{1}{α} Q Y - QX β ^ _{m, Q} ^{2} )}{\sum _{k \in M_{u}} exp ( - \frac{1}{α} Q Y - QX β ^ _{k, Q} ^{2} )},

\hat{β}_{EW, Q} ≜ m \in M_{u} \sum w_{m, Q} \hat{β}_{m, Q} .

\hat{β}_{EW, Q} ≜ m \in M_{u} \sum w_{m, Q} \hat{β}_{m, Q} .

F_{EW} ≜ \frac{A Y - A X β ^ _{EW, Q} ^{2} / n _{a}}{B Y - B X β ^ _{EW, Q} ^{2} / n _{b}} .

F_{EW} ≜ \frac{A Y - A X β ^ _{EW, Q} ^{2} / n _{a}}{B Y - B X β ^ _{EW, Q} ^{2} / n _{b}} .

ϕ_{δ} ≜ \mathbbm 1 (F_{EW} > F_{n_{a}, n_{b}, δ}) .

ϕ_{δ} ≜ \mathbbm 1 (F_{EW} > F_{n_{a}, n_{b}, δ}) .

ν \sim N_{v} (0_{v}, Ψ) .

ν \sim N_{v} (0_{v}, Ψ) .

n \to \infty lim inf (u - s) \geq 0,

n \to \infty lim inf (u - s) \geq 0,

n \to \infty lim \frac{u lo g ( p )}{n _{a} \land n _{b}} \to 0.

n \to \infty lim inf (u - s) \geq 0,

n \to \infty lim inf (u - s) \geq 0,

n \to \infty lim \frac{u lo g ( p )}{n} = 0,

n_{a} ≍ n_{b} ≍ n .

F_{EW} = F + o_{P} (1),

F_{EW} = F + o_{P} (1),

H_{0} : λ_{max} (A Z Ψ Z^{T} A^{T}) = 0

H_{0} : λ_{max} (A Z Ψ Z^{T} A^{T}) = 0

H_{0} : σ_{ν}^{2} = 0

H_{0} : σ_{ν}^{2} = 0

Y = μ + ξ,

Y = μ + ξ,

w_{m} ≜ \frac{exp ( - \frac{1}{α} Y - X β ^ _{m} ^{2} )}{\sum _{k \in M_{u}} exp ( - \frac{1}{α} Y - X β ^ _{k} ^{2} )},

w_{m} ≜ \frac{exp ( - \frac{1}{α} Y - X β ^ _{m} ^{2} )}{\sum _{k \in M_{u}} exp ( - \frac{1}{α} Y - X β ^ _{k} ^{2} )},

\frac{1}{n} E m \in M_{u} \sum w_{m} X \hat{β}_{m} - μ^{2} \to 0.

\frac{1}{n} E m \in M_{u} \sum w_{m} X \hat{β}_{m} - μ^{2} \to 0.

\frac{1}{n} E m \in M_{u} \sum w_{m} X \hat{β}_{m} - μ^{2} \to 0.

\frac{1}{n} E m \in M_{u} \sum w_{m} X \hat{β}_{m} - μ^{2} \to 0.

A Y = A X β + A Z ν + A ε,

A Y = A X β + A Z ν + A ε,

B Y = B X β + B ε,

D^{- 1/2} V^{T} A Y = D^{- 1/2} V^{T} A X β + D^{- 1/2} V^{T} A Z ν + D^{- 1/2} V^{T} A ε .

D^{- 1/2} V^{T} A Y = D^{- 1/2} V^{T} A X β + D^{- 1/2} V^{T} A Z ν + D^{- 1/2} V^{T} A ε .

\overset{σ}{^}_{ε, Q}^{2} ≜ \frac{1}{n _{b}} B Y - B X \hat{β}_{EW, Q}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models

Full text

Estimating the Random Effect in Big Data Mixed Models

Michael Law

Ya’acov Ritov Supported in part by NSF Grant DMS-1712962.

( University of Michigan

March 2, 2024)

Abstract

We consider three problems in high-dimensional Gaussian linear mixed models. Without any assumptions on the design for the fixed effects, we construct an asymptotic $F$ -statistic for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate $\sqrt{n}$ , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data.

1 Introduction

In the past two decades, there has been a lot of progress in the theory for high-dimensional linear models. However, its close cousin, the high-dimensional linear mixed model, has received significantly less attention; it was not until the past decade until there were procedures for estimation. Consider a linear mixed model given by

[TABLE]

with $Z\in\mathbb{R}^{n\times v}$ , $W\in\mathbb{R}^{n\times r}$ , and $Y,\mu,\varepsilon\in\mathbb{R}^{n}$ . In addition, we observe covariates $X\in\mathbb{R}^{n\times p}$ such that $\Delta\triangleq\mu-X\beta$ is a small remainder term for some sparse vector $\beta\in\mathbb{R}^{p}$ (see Section 1.2 for a rigorous definition). Therefore, the model may be written as

[TABLE]

Here, $X$ is a design corresponding to the fixed effects and $(Z,W)$ to the random effects. We consider the setting where the random effects are low-dimensional, $v+r<n$ , but the fixed effects are high-dimensional, $p>n$ . We have separated the random effects in two to emphasize later that we are interested in $\nu$ and view $\gamma$ as nuisance parameters. Various authors have considered different aspects of this problem.

The earliest work of \citeAschelldorfer2011 proposed an estimator for both $\beta$ and the variance components using a lasso-type approach. These types of approaches were later extended by several authors who considered estimation with both convex, such as \citeAgroll2014, and non-convex penalties, such as \citeAwang2012. There has also a growing literature on model selection in high-dimensional linear mixed models (cf. \citeAmuller2013 and the references therein).

The problem of inference is slightly less well studied. To the best of our knowledge, \citeAchen2015 was the first to consider hypotheses testing problems for random effects in ANOVA type settings and \citeAbradic2017 for fixed effects in high-dimensional models. During the preparation of this manuscript, we became aware of the very recent work of \citeAli2019, who consider the problem of inference in high-dimensional linear mixed models. In particular, they discuss inference for fixed effects and estimation of variance components.

The goal of the present paper is to contribute to this growing literature on high-dimensional linear mixed models, both in terms of estimation and inference when the random effects and error are all Gaussian. In particular, we consider three related problems:

Testing whether $\nu\equiv 0$ . 2. 2.

Constructing confidence intervals for $\sigma_{\nu}^{2}$ when $\nu\sim\mathcal{N}_{v}\left(0_{v},\sigma_{\nu}^{2}I_{v}\right)$ . 3. 3.

Estimating using empirical Bayes in ANOVA Type Models.

Our methodology is inspired by both low-dimensional linear mixed models as well as high-dimensional linear models. Specifically, our approach to all three problems starts with considering a procedure in the corresponding low-dimensional problem and retrofitting it with tools and techniques from high-dimensional linear models to produce a procedure for the high-dimensional linear mixed model.

1.1 Organization of the Paper

We will end the current section with a description of the notation that we will adopt throughout the paper. Sections 2, 3, and 4 consider each of the three problems outlined in the Introduction respectively. Each one starts with a description of the problem setup, a brief motivation from the low-dimensional problem, a description of the estimator that will be considered, and ends with some theoretical results. In Section 5, we provide an overview of the computation of the estimators as well as simulations and a real data application. We conclude with discussions in Section 6. For the ease of presentation, we defer all proofs to Section 7.

1.2 Notation

Throughout, all of our variables will have a dependence on $n$ . However, when this will not cause confusion, we will suppress this dependence. Let $\left\|\cdot\right\|$ denote the standard (unscaled) Euclidean norm with the dimension of the space being implicit from the vector and $\left\|\cdot\right\|_{0}$ the $L_{0}$ -“norm” on $\mathbb{R}^{p}$ . We will also use $\left\|\cdot\right\|$ to denote the $\ell_{2}$ norm for matrices and $\left\|\cdot\right\|_{\text{HS}}$ to denote the Hilbert-Schmidt norm. Furthermore, $\lambda_{\text{max}}(\cdot)$ and $\lambda_{\text{min}}(\cdot)$ will denote the maximal and minimal eigenvalue respectively.

Consistent with other high-dimensional works, we will assume that $\beta$ is a sparse vector. There are various notions of sparsity but we will assume the slightly more general setting of weak sparsity. For $u\in\mathbb{N}$ , we will let $\mathcal{M}_{u}$ denote the collection of all models with the dimension of the fixed effects design equal to $u$ . For a model $m\in\mathcal{M}_{u}$ , $X_{m}$ will denote the $n\times u$ sub-matrix of $X$ corresponding to the columns indexed by $m$ . Moreover, $P_{m}$ will denote the projection onto $X_{m}$ and ${P}^{\perp}_{m}$ the projection onto the orthogonal complement. We will consider the following definition of weak sparsity.

Definition 1.

The vector $\mu$ is said to satisfy the weak sparsity property relative to $X$ with sparsity $s$ at rate $k$ if the set

[TABLE]

is non-empty. A set $S\in\mathcal{S}_{\mu}$ is said to be a weakly sparse set for the vector $\mu$ .

Then, we will let $S\in\mathcal{S}_{\mu}$ denote any weakly sparse set for $\mu$ .

2 A High-Dimensional $F$ -Test: Testing $\nu\equiv 0$

2.1 Model and Motivation

Consider the high-dimensional linear mixed model given in equation (1). For the fixed effects $X$ , we will make no assumption on the design besides independence with $(\nu,\gamma,\varepsilon)$ . For the random effects, we will similarly make no assumption on $(W,Z)$ , except for the fact that $(W,Z)$ is independent of $(\nu,\gamma,\varepsilon)$ . We will assume that $\varepsilon\sim\mathcal{N}_{n}(0_{n},\sigma_{\varepsilon}^{2}I_{n})$ for some positive constant $\sigma_{\varepsilon}^{2}>0$ and $\nu\sim\mathcal{N}_{v}(0_{v},\Psi)$ for some symmetric positive semi-definite matrix $\Psi$ . No assumption will be necessary on $\gamma$ as the nuisance parameters will be projected out in the first stage.

We are interested in the hypotheses testing problem

[TABLE]

Suppose temporarily that we are in the low-dimensional setting with $s=p$ , $p+v+r<n$ , and $\Delta\equiv 0$ . Then, in this low-dimensional problem, with $\nu$ and $\varepsilon$ both distributed as Gaussian, the standard procedure for testing whether $\nu\equiv 0$ is through the $F$ -test. Letting $\tilde{A}$ denote the projection onto $Z\ominus(X,W)$ with $n_{\tilde{a}}=\operatorname{rank}(\tilde{A})$ and $\tilde{B}$ denote the projection onto $(X,Z,W)^{\perp}$ with $n_{\tilde{b}}=\operatorname{rank}(\tilde{B})$ , we have that

[TABLE]

Then, under the null hypothesis, the above statistic has an $F_{n_{\tilde{a}},n_{\tilde{b}}}$ distribution. It is known that the $F$ -test has certain optimality properties for certain classes of hypotheses testing problems (cf \citeAjiang2007 and the references therein for more details).

The main obstacle to directly using this $F$ -test in the high-dimensional setting is removing the contribution of the fixed effects. One possibility is to perform model selection and choose the relevant covariates from $X$ and then use the low-dimensional $F$ -test. \citeAchen2015 consider a similar problem in the high-dimensional setting and they use a SCAD based approach for variable selection. As a consequence, they require $p=o(\sqrt{n})$ . Instead, we leverage the fact that a projection onto a particular space is a regression onto a design whose columns span that same space.

Expanding both the numerator and the denominator of the classical $F$ -statistic, we have that

[TABLE]

In both matrices above, they project onto the orthogonal complement of $W$ , which may still be achieved in the high-dimensional problem since $W$ is a low-dimensional matrix. We may find two matrices, $A$ and $B$ , such that

[TABLE]

where $A\in\mathbb{R}^{{n_{a}}\times n}$ and $B\in\mathbb{R}^{{n_{b}}\times n}$ have orthonormal rows that are mutually orthogonal (ie. $AB^{\mathsf{T}}=0_{n_{a}\times n_{b}}$ ). For simplicity, we will write

[TABLE]

Rewriting, we have that

[TABLE]

If $QX$ was low-dimensional, to obtain the projection of $QY$ onto the orthogonal complement of $QX$ , this is equivalent to finding the residuals of $QY$ using the covariates $QX$ ; this yields $QY-QX\hat{\beta}$ . Then, we have that

[TABLE]

Hence, this recasts the problem into one of high-dimensional prediction, for which there have been many procedures suggested to estimate $QX\beta$ , such as the lasso and exponential weighting (cf. \citeAtibshirani1996 and \citeAleung2006 respectively). Therefore, we propose using a plug-in estimator for $QX\beta$ using exponential weighting of all models of a particular size and then consider the resultant residuals. This idea, under mild assumptions, will provide an asymptotic $F$ -test.

2.2 Estimator

Rather than using exponential weighting for $AY$ and $BY$ separately, we will estimate the mean vector jointly; the advantages to this approach are twofold. On the one hand, it will significantly simplify our implementation as we will only need a single Markov chain (see Section 5) and, on the other, the estimation seems to perform better empirically since we pool more observations together. Therefore, we will fix a sequence of constants $u\in\mathbb{N}$ . Let $\hat{\beta}_{m,\text{Q}}$ denote the least-squares estimator of $\beta$ using the model $m\in\mathcal{M}_{u}$ with covariates $QX_{m}$ . We will define our set of exponential weights by

[TABLE]

where $\alpha\geq 6\lambda_{\text{max}}\left(QZZ^{\mathsf{T}}Q^{\mathsf{T}}+\sigma_{\varepsilon}^{2}I_{n_{a}+n_{b}}\right)$ . We will estimate $\beta$ by

[TABLE]

The bound on $\alpha$ is to ensure the consistency result of Lemma 3 under both the null and the alternative hypotheses. The subscript $Q$ is to emphasize that the estimator is constructed only using the $Q$ projected sub-data from (6). Then, the corresponding $F$ -statistic will be

[TABLE]

Similar to the classical $F$ -statistic, we will reject the null hypothesis for large values of $F_{\text{EW}}$ . In particular, for a value $\delta\in(0,1)$ , let $F_{n_{a},n_{b},\delta}$ denote the $\delta$ upper quantile of an $F_{n_{a},n_{b}}$ distribution. Then, we will consider tests of the form

[TABLE]

2.3 Assumptions

We will assume without the loss of generality that the columns of $X$ have squared norm that is $\mathcal{O}(n)$ . Moreover, we will assume that

(A1)

The vector $\nu$ satisfies

[TABLE]

Furthermore, $\lambda_{\text{max}}\left(QZ\Psi Z^{\mathsf{T}}Q^{\mathsf{T}}+\sigma_{\varepsilon}^{2}I_{n}\right)=\mathcal{O}(1)$ . 2. (A2)

The sequence of constants $u$ satisfies

[TABLE]

(A $2^{\ast}$ )

The sequence of constants $u$ and the number of observations in the reduced models, $n_{a}$ and $n_{b}$ , satisfy

[TABLE]

*Remark 1**.*

There are two assumptions regarding the sparsity and the number of random effects. The first assumption, (A2), is quite weak in terms of the sparsity level and is used to establish the distribution of the estimator under the null hypothesis. Considering the models in equations (4) and (5), which have $n_{a}$ and $n_{b}$ observations respectively, the sparsity only needs to be of smaller order than the number of observations in each of the two reduced models. This is known to be the optimal rate at which the mean vector can be estimated in a high-dimensional linear model (for example, cf. \citeAraskutti2011 or \citeArigollet2011). The requirement that $\liminf(u-s)\geq 0$ is only to ensure that the models we consider for exponential weighting are sufficiently large to remove the fixed effects.

The stronger assumption, (A $2^{\ast}$ ), is used when considering the power of the test under a specific contiguous alternative (see equation (7)), which gives the parametric rate of $\sqrt{n}$ . However, in cases where we have a different sequence of alternatives, the assumption may be weakened accordingly. Note that this is similar to the setting of the de-biased lasso, which requires a ultra sparsity to achieve the parametric rate in testing a single covariate (cf. \citeAcai2017).

2.4 Main Results for $F_{\text{EW}}$

We start by stating the asymptotic distribution of $F_{\text{EW}}$ under the null hypothesis.

Theorem 1.

Consider the linear mixed model given by equation (1) and the hypotheses testing problem from equation (2). Assume (A2). Under the null hypothesis and $\alpha\geq 4\sigma_{\varepsilon}^{2}$ ,

[TABLE]

where $F\sim F_{n_{a},n_{b}}$ .

This naturally leads to the question regarding the power of the testing procedure. In particular, between what contiguous alternatives can we distinguish? We will consider the following contiguous testing problem.

[TABLE]

*Remark 2**.*

Suppose that $\nu$ corresponds to a single random effect and the design is balanced, with $\nu\sim\mathcal{N}_{v}(0_{v},\sigma_{\nu}^{2}I_{v})$ . Then, the above hypotheses becomes

[TABLE]

which is a standard hypotheses testing problem, such as in the balanced one-way random effects model. In this model, in the low-dimensional setting, the rate of $\sqrt{n}$ is optimal.

Theorem 2.

Consider the linear mixed model given by equation (1) and the hypotheses testing problem from equation (7). Assume further (A1) and (A $2^{\ast}$ ). Fix a value of $\delta>0$ . Under the alternative hypothesis with $h>0$ sufficiently large (not depending on $n$ ) and $\alpha\geq 6\lambda_{\text{max}}\left(QZ\Psi Z^{\mathsf{T}}Q^{\mathsf{T}}+\sigma_{\varepsilon}^{2}I_{n_{a}+n_{b}}\right)$ , the sum of type I and type II error for the test statistic $\phi_{\delta}$ is between zero and one.

To prove the theorem, we will need the following lemma that provides consistency of $\hat{\beta}_{\text{EW},Q}$ under the setting of correlated errors, which may be of interest in its own right.

Lemma 3.

Consider a high-dimensional linear model given by

[TABLE]

with $\xi\sim\mathcal{N}_{n}(0_{n},\Sigma)$ for some covariance matrix $\Sigma$ satisfying $\lambda=\lambda_{\text{max}}(\Sigma)=\mathcal{O}(1)$ . Assume that $\mu$ is weakly sparse relative to $X$ with sparsity $s$ at rate $n$ and $\left\|X\beta\right\|^{2}=\mathcal{O}(n)$ . Assume further $\log(|\mathcal{M}_{u}|)=o(n)$ . Letting $\hat{\beta}_{m}$ denote the least-squares estimator for $\beta$ using the covariates $X_{m}$ , define the exponential weights as

[TABLE]

with $\alpha>6\lambda$ . Then,

[TABLE]

If in addition $\mu$ is weakly sparse relative to $X$ with sparsity $s$ at rate $\sqrt{n}$ and $\log(|\mathcal{M}_{u}|)=o(\sqrt{n})$ , then

[TABLE]

*Remark 3**.*

The above lemma slightly generalizes Theorem 5 of \citeAleung2006 to heteroskedastic and dependent, but still Gaussian, errors. However, in the setting where the errors are independent and identically distributed, the result of \citeAleung2006 is stronger as their bound for $\alpha$ is smaller.

3 Confidence Intervals for $\sigma_{\nu}^{2}$

3.1 Model and Motivation

In the previous section, we considered the problem of testing a collections of random effects. However, it is often of interest to construct confidence intervals for the variance of a particular random effect. Suppose that $\Psi=\sigma_{\nu}^{2}I_{v}$ . In the low-dimensional setting, there have been many procedures suggested to construct confidence intervals, from likelihood based approaches to $F$ -test inversions (for example, see \citeAjiang2007 and the references therein for a non-exhaustive list).

We take an approach similar to variance estimation in linear models. Consider the two reduced models from Section 2, which are reproduced below for convenience:

[TABLE]

The second model is a standard high-dimensional linear model with $B\varepsilon\sim\mathcal{N}_{n_{b}}\left(0_{n_{b}},\sigma_{\varepsilon}^{2}I_{n_{b}}\right)$ . In the low-dimensional setting, $\sigma_{\varepsilon}^{2}$ would be estimated by mean squared error, which is computed through projecting $BY$ onto the orthogonal complement of $BX$ . Hence, we may achieve a similar result by viewing this as a prediction problem and regressing out $BX$ . For the other model, note that we have a high-dimensional linear model with noise $AZ\nu+A\varepsilon\sim\mathcal{N}_{n_{a}}\left(0_{n_{a}},\sigma_{\nu}^{2}AZZ^{\mathsf{T}}A^{\mathsf{T}}+\sigma_{\varepsilon}^{2}I_{n_{a}}\right)$ . Since the matrix $A$ and $Z$ are both observed, we may whiten the data with respect to $\nu$ by taking the Spectral Decomposition of $AZZ^{\mathsf{T}}A^{\mathsf{T}}=VDV^{\mathsf{T}}$ and left multiplying by $D^{-1/2}V^{\mathsf{T}}$ . Here, $D$ is diagonal and $V$ is unitary. This yields

[TABLE]

Then, the above can be viewed as a high-dimensional linear model with noise given by $D^{-1/2}V^{\mathsf{T}}AZ\nu+D^{-1/2}V^{\mathsf{T}}A\varepsilon\sim\mathcal{N}_{n_{a}}\left(0_{n_{a}},\sigma_{\nu}^{2}I_{n_{a}}+\sigma_{\varepsilon}^{2}D^{-1}\right)$ . Since $D^{-1}$ is observed and $\sigma_{\varepsilon}^{2}$ can be estimated, this will provide an estimator of $\sigma_{\nu}^{2}$ , which we may then use to construct a pivot.

3.2 Estimator

As outlined previously, we will need to construct two estimators, one for $\sigma_{\varepsilon}^{2}$ and one for $\sigma_{\nu}^{2}$ . Again, we will jointly estimate the mean vector $Q\mu$ . Therefore, we will start by defining

[TABLE]

Then, $\hat{\sigma}_{\nu,Q}^{2}$ will be given by

[TABLE]

3.3 Assumptions

In addition to the assumptions from Section 2, we will need two additional assumptions regarding the distribution of the values of the diagonal matrix $D$ .

(B1)

Letting $d=\operatorname{diag}(D)$ , the norms satisfy

[TABLE] 2. (B2)

The sample sizes $n_{a}$ and $n_{b}$ satisfy

[TABLE]

*Remark 4**.*

Assumption (B1) on the matrix $D$ is to ensure that the design is sufficiently well balanced and that $Z$ is distinct enough from $W$ . This assumption is satisfied in general settings such as relatively balanced ANOVA models.

Assumption (B2) is a technical regularity condition in view of the triangular array framework. By possibly reducing to a convergent sub-sequence, we may always satisfy assumption (B2).

3.4 Main Results for $\hat{\sigma}_{\nu}^{2}$

We will start by stating the asymptotic distribution of $\hat{\sigma}_{\nu}^{2}$ .

Theorem 4.

Consider the linear mixed model given by equation (1), with the reduced models given in equations (5) and (8). Assume further (A1) with $\Psi=\sigma_{\nu}^{2}I_{v}$ , (A $2^{\ast}$ ), (B1), and (B2). Then,

[TABLE]

From the previous theorem, we immediately have that the following is a $(1-\alpha)$ confidence interval for $\sigma_{\nu}^{2}$ .

[TABLE]

In general, the values of $\sigma_{a}^{2}$ , $\sigma_{b}^{2}$ , and $\bar{d}$ are unknown quantities and need to be estimated. The following proposition yields consistent estimators for each of those quantities.

Proposition 5.

Under the assumptions of Theorem 4,

[TABLE]

Therefore, we may plug in the values of $\hat{\sigma}_{a}^{2}$ , $\hat{\sigma}_{b}^{2}$ , and $\bar{d}$ into equation (9) to obtain an asymptotic confidence interval.

4 Empirical Bayes in ANOVA Type Models

The motivating example of this problem framework is in terms of the Rasch model, originally proposed by \citeArasch1960. The model that we consider is slightly different than the classical Rasch model in that we have Gaussian responses as opposed to binary responses.

As an example of this model, the data that we will consider in Section 5.3 is from the Trends in Mathematics and Sciences Study (TIMSS), an international study conducted every four years to measure fourth and eighth grade student achievement in mathematics and science. We will only consider data from the year 2015. Countries randomly sample a collection of nationally representative schools to take standardized examinations in both mathematics and science, with questions being either multiple choice or constructed response. Then, each student within schools takes only a subset of the questions on the exams but all questions are answered by some students in each school. In addition to recording student responses, the data also contains background covariates for schools. \citeAmartin2016 provides a more detailed description of the methods and procedures employed by TIMSS and more general information about TIMSS is available in \citeAmullis20years.

For our analysis, we will only consider multiple choice questions and analyze on the level of school rather than students. To construct a response variable for school, we compute the proportion of questions answered correctly by students in that school. Note that, unlike the classical Rasch model, we assume a linear model and, for all schools, we have answers for all questions. Thus, by the Central Limit Theorem, our response $Y$ will be approximately Gaussian. The fixed effects design $X$ will be the background covariates for the school and the random effects design $Z$ will be an indicator for the country, with $\nu$ corresponding to the unobserved variability of the countries. In this example, we do since we have averaged over questions, we do not have any nuisance random effects. The problem that we will consider in this section is ranking the countries based on mathematical ability and trying to estimate the average number of questions that any particular country will answer correctly. That is, we would like to estimate $\mu+Z\nu$ for all countries.

4.1 Model and Motivation

The general problem framework that we consider is for $K$ -factor ANOVA models. However, we will derive the results in the setting when $K=2$ . That is, we will consider the model

[TABLE]

We do not assume that the design is fully crossed in the random effects. The goal in the problem is to estimate a subset of the mean vector, $\eta\triangleq\mu+Z\nu$ , since we view the random effects $W$ as nuisance. However, as the sample size increases, the number of observations per group stays bounded. In the context of the motivating data example, each country still only answers a finite number of questions as we increase the sample size. Therefore, it is not possible to consistently estimate the entire vector $\eta$ . A standard approach in the low-dimensional setting would be to use an empirical Bayes estimator by placing a Gaussian prior on both $\nu$ and $\gamma$ (for example, see \citeAbrown2018), which would then transform the problem into a standard high-dimensional linear mixed model. Therefore, we will use a $\mathcal{N}_{v}\left(0_{v},\sigma_{\nu}^{2}I_{v}\right)$ and $\mathcal{N}_{r}\left(0_{r},\sigma_{\gamma}^{2}I_{r}\right)$ prior on $\nu$ and $\gamma$ respectively.

Since we will need to estimate both $\sigma_{\nu}^{2}$ and $\sigma_{\gamma}^{2}$ for the prior, our estimator for $\sigma_{\gamma}^{2}$ will be analogous to $\hat{\sigma}_{\nu}^{2}$ from Section 3. To this end, we will need an additional matrix $C$ such that

[TABLE]

Further, $C$ may be chosen such that $C$ has orthonormal rows that are mutually orthogonal with $(A^{\mathsf{T}},B^{\mathsf{T}})^{\mathsf{T}}$ . We will let $n_{c}$ denote the number of rows of $C$ . Similarly, we will need to consider the Spectral Decomposition of $CWW^{\mathsf{T}}C^{\mathsf{T}}=\Gamma\Lambda\Gamma^{\mathsf{T}}$ . Similarly, we have that $\Lambda$ is diagonal and $\Gamma$ is unitary. Then, we may take the transformation

[TABLE]

to whiten the data with respect to $\gamma$ .

4.2 Estimator

To estimate $\mu$ , we will define $\hat{\beta}_{\text{EW}}$ to be the exponentially weighted estimator for model (1), with $\hat{\mu}_{\text{EW}}\triangleq X\hat{\beta}_{\text{EW}}$ .

From Theorem 4, we immediately have that $\hat{\sigma}_{\nu,Q}^{2}\overset{\mathbb{P}}{\to}\bar{d}\sigma_{\nu}^{2}$ . This suggests the empirical prior $\mathcal{N}_{v}\left(0_{v},\hat{\sigma}_{\nu,Q}^{2}/\hat{d}\right)$ for $\nu$ . However, we will not use this estimator, but a slightly modified version by jointly estimating the vector $\mu$ using all of the observations. Before we define a modified version of $\hat{\sigma}_{\nu,Q}^{2}$ , we will first need to adapt our estimator of $\sigma_{\varepsilon}^{2}$ , which will be given by

[TABLE]

Then, we may define

[TABLE]

We may similarly define $\hat{\sigma}_{\gamma}^{2}$ as

[TABLE]

Then, under some assumptions, it will follow that $\hat{\sigma}_{\gamma}^{2}\overset{\mathbb{P}}{\to}\bar{\lambda}\sigma_{\gamma}^{2}$ analogous to Theorem 4 where $\bar{\lambda}$ is defined in assumption (C1). Hence, we will estimate $\sigma_{\nu}^{2}$ by $\hat{\sigma}_{\nu}^{2}/\hat{d}$ , $\sigma_{\gamma}^{2}$ by $\hat{\sigma}_{\gamma}^{2}/\hat{\lambda}$ , and $\sigma_{\varepsilon}^{2}$ by $\hat{\sigma}_{\varepsilon}^{2}$ . This will give the following empirical Bayes estimator for $\eta$ ,

[TABLE]

To compare our estimator, we will consider an oracle that has access to $\mu$ , $\sigma_{\nu}^{2}$ , $\sigma_{\gamma}^{2}$ , and $\sigma_{\varepsilon}^{2}$ . Then, this oracle will use the Bayes estimator for $\eta$ (see Lemma 7), given by

[TABLE]

4.3 Assumptions

Unlike the previous section, we will not need to establish the asymptotic distribution of $\hat{\sigma}_{\nu}^{2}$ , rather we only need the estimator to be consistent. Accordingly, we may weaken our assumptions to the following

(C1)

The matrices $D$ and $\Lambda$ satisfy

[TABLE] 2. (C2)

The sequence of constants $u$ satisfies

[TABLE] 3. (C3)

The design matrices satisfy

[TABLE]

*Remark 5**.*

Assumption (C1) is weaker than the combination of assumptions (B1) and (B2). This will ensure consistency in estimating $\sigma_{\nu}^{2}$ , $\sigma_{\gamma}^{2}$ , and $\sigma_{\varepsilon}^{2}$ .

The next assumption, (C2), is identical to assumption (A2) but with the additional constraint on $n_{c}$ to ensure consistent prediction of $\Lambda^{-1/2}\Gamma^{\mathsf{T}}C\mu$ in equation (10).

The last assumption, (C3), is equivalent to, in the context of the motivating example, each school only answering $\mathcal{O}(1)$ number of questions and each question being answered by $\mathcal{O}(1)$ number of schools.

4.4 Main Results for $\hat{\eta}_{\text{EW}}$

We will start this section by proving that $\hat{\sigma}_{\nu}^{2}$ , $\hat{\sigma}_{\gamma}^{2}$ , and $\hat{\sigma}_{\varepsilon}^{2}$ are all consistent estimators.

Proposition 6.

Consider the linear mixed model given by equation (1), with the reduced models given in equations (5), (8), and (10). Then, assuming (C1) and (C2),

[TABLE]

The following is a standard lemma regarding the empirical Bayes estimators in this problem setup, which we will prove for the sake of completeness.

Lemma 7.

For a fixed vector $\mu\in\mathbb{R}^{n}$ and fixed values $\sigma_{\nu}^{2}>0$ , $\sigma_{\gamma}^{2}>0$ , and $\sigma_{\varepsilon}^{2}>0$ , the Bayes estimator of $\eta$ is given by

[TABLE]

We conclude this section with the main result regarding $\hat{\eta}_{\text{EW}}$ , that the empirical Bayes estimator performs nearly as well as the oracle Bayes estimator $\hat{\eta}_{\text{oracle}}$ asymptotically.

Theorem 8.

Consider the linear mixed model given in (1). Assume (C1), (C2), and (C3). Then,

[TABLE]

5 Implementation and Simulations

5.1 Implementation

In this section, we describe how to compute all three of the proposed estimators. The computation can be split into two halves:

Obtaining the matrices $A$ , $B$ , $C$ , $D$ , $V$ , $\Lambda$ , and $\Gamma$ with which to project the data. 2. 2.

Computing the estimators via exponential weighting.

For the first part, the matrices $A$ , $B$ , and $C$ can be obtained by adding the $n$ standard basis vectors in $\mathbb{R}^{n}$ to $(W,Z)$ to form a basis for $\mathbb{R}^{n}$ and applying the modified Gram-Schmidt procedure. To compute $D$ and $V$ , we form the matrices $AZZ^{\mathsf{T}}A^{\mathsf{T}}$ and directly apply the eigendecomposition to obtain both the eigenvalues and the eigenvectors. An analogous procedure is used to compute $\Lambda$ and $\Gamma$ .

For the second part, it may be further sub-divided into three parts:

Determining a bound $\alpha$ . 2. 2.

Determining a value of $u$ . 3. 3.

Computing $|\mathcal{M}_{u}|={p\choose u}=\mathcal{O}(p^{u})$ models.

To determine $\alpha$ , we only need an upper bound on the maximal eigenvalue of the noise vector. This may simply be bounded by the variance of $Y$ by choosing $\alpha=4\left\|Y\right\|^{2}/n$ . Though this will choose a conservative value of $\alpha$ , asymptotically this will have no effect. For $u$ , we may following the suggestion of \citeArigollet2011 and apply the exponential screening prior. They say that a covariate $X_{j}$ is selected if $\hat{\beta}_{\text{ES,j}}>1/n$ , where $\hat{\beta}_{\text{ES,j}}$ is the $j$ ’th entry of the exponential screening estimator. Note that, after computing the matrix $B$ , the model $BY=BX\beta+B\varepsilon$ is a standard high-dimensional linear model with the same level of sparsity as the original model, which is in the framework of \citeArigollet2011. Alternatively, we may apply cross-validation to tune the sparsity level, though this is very computationally intensive.

Once both are selected, we still need to compute ${p\choose u}$ least-squares estimators. However, \citeArigollet2011 and \citeArigollet2012 proposed a Metropolis Hastings approach to approximating the exponential weighting estimator. We will refer the interested reader to \citeArigollet2012 for the details.

5.2 Simulations

5.2.1 Methods and Models

We will consider the linear mixed model given by

[TABLE]

with $n=200$ , $s=3$ , and $p=500$ . For simplicity, we will set $\mu=X\beta$ . Since no assumption is made on the fixed effects design $X$ , we will generate $X$ from a multivariate Gaussian $\mathcal{N}_{n}\left(0_{n},\Sigma\right)$ , where $\Sigma$ is an equi-correlation matrix. That is,

[TABLE]

for $\rho\in\{0,0.8\}$ . As our fixed effects design is perfectly symmetric, we will take the strongly sparse set $S$ to be the first $s$ components with $\beta_{j}=1$ for $j\in S$ . We will fix the value of $\sigma_{\varepsilon}^{2}=1$ . For simplicity, we will let $\nu$ and $\gamma$ each correspond to a single random effect that are independent of each other. Therefore, we will have that

[TABLE]

Finally, with regards to the random effects design, we will consider the setting where $(v,r)\in\{25,50\}^{2}$ . Since the number of observations will be less than the number of possible combinations, we will start by generating a fully crossed design and then down sample $n$ rows without replacement of the fully crossed design. To experiment with both the level and the power of our procedure, we will consider $(\sigma_{\nu}^{2},\sigma_{\gamma}^{2})\in\{0,1\}^{2}$ . Hence, we will consider each of the following settings

[TABLE]

For each setting, we will generate a single draw of the design for $(X,W,Z)$ . Then, we will draw $N=100$ trials of $(\nu,\gamma,\varepsilon)$ and compute each of our estimators on these trials. Whenever a variance estimate is negative, we will set the estimated value to zero. When using exponential weighting, we will estimate the sparsity using the exponential screening prior of \citeArigollet2011 on the model $BY=BX\beta+B\varepsilon$ . Moreover, we will use the value of $\alpha=4\left\|Y\right\|^{2}/n$ .

All of our simulations are conducted in R. For each of our three problems, we will compare our estimator with an oracle low-dimensional estimator as well as a low-dimensional version of our proposed high-dimensional statistic. For the oracle estimators, in the setting of the $F$ -test, we directly apply the classical low-dimensional $F$ -test that has access to the true sparse set $S$ , as given in equation (2.3) of \citeAjiang2007, which we will denote by $F_{\text{LD}}$ . For the confidence intervals, we fit the linear mixed models with the true sparse set $S$ using lmer and applying the confint function. We will denote this estimator as $\hat{\sigma}_{\nu,\text{LD}}^{2}$ . Finally, in the setting of estimation, we directly compute the oracle Bayes estimator $\hat{\eta}_{\text{oracle}}$ described in Section 4. Collectively, things subscripted with “LD” will denote low-dimensional estimators.

In addition to comparing with the low-dimensional estimators, we also construct low-dimensional versions of our proposed high-dimensional statistics. To do so, we use the exact same statistic as in the high-dimensional setting but replace exponential weighting with least-squares using the sparse set $S$ . Hence, all of these statistics will be subscripted by “LS” to denote the usage of least-squares. We make this comparison since all of our proposed statistics rely on two layers of asymptotics:

In the prediction of the mean vector via exponential weighting. 2. 2.

In the convergence once the residuals are obtained.

To differentiate between these two, we introduce an intermediate statistic that relies on least-squares, which we think of as low-dimensional versions of our statistics. For example, recalling that $\hat{\beta}_{S}$ is the least-squares estimator of $\beta$ using the covariates $X_{S}$ , we will also consider the statistic

[TABLE]

Similarly, we will consider $\hat{\sigma}_{\nu,\text{LS}}^{2}$ and $\hat{\sigma}_{\varepsilon,\text{LS}}^{2}$ given by

[TABLE]

Then, we will have that

[TABLE]

and

[TABLE]

To compare the three procedures, we will consider the following metrics

Average Coverage: The percentage of time the correct hypothesis is selected for $F$ -tests or the percentage of time the true value of $\sigma_{\nu}^{2}$ is in the confidence interval. 2. 2.

Average Length: The average length of the confidence interval, taken as the upper endpoint minus the lower endpoint. 3. 3.

Average Loss: The average squared Euclidean distance between the estimated vector $\hat{\eta}$ and the true vector $\eta$ divided by $n$ .

5.2.2 Results

Tables (1) - (4) provide the results of our simulations. We notice that, on average, $F_{\text{EW}}$ and $\hat{\sigma}_{\nu}^{2}$ are slightly less conservative than $F_{\text{LD}}$ and $\hat{\sigma}_{\nu,\text{LD}}^{2}$ respectively. However, as $v$ increases from $25$ to $50$ , we generally notice that type I error decreases and coverage increases, suggesting that the main problem that is asymptotic. In particular, even with the correct low-dimensional covariates, $F_{\text{LS}}$ and $\hat{\sigma}_{\nu,\text{LS}}^{2}$ still have higher type I error and lower coverage but does outperform both $F_{\text{EW}}$ and $\hat{\sigma}_{\nu}^{2}$ , as expected. However, the confidence intervals for $\hat{\sigma}_{\nu}^{2}$ seem to be overly optimistic when the design is more correlated and the true value of $\sigma_{\nu}^{2}=1$ . In regards to estimation of $\eta$ , the loss for $\hat{\eta}_{\text{EW}}$ is generally close to the oracle loss, as predicted by Theorem 8.

Moreover, Figures (1) and (2) show some simulated confidence intervals for $10$ trials. The fact that $F_{\text{LS}}$ and $F_{\text{EW}}$ occasionally have shorter confidence intervals than profile likelihood might be attributable to the more optimistic nature of those two procedures.

5.3 Real Data Application

Following in the motivating example of Section 4, we consider the Trends in International Mathematics and Sciences Study dataset, which is freely available online. To simplify our analysis, we only consider the mathematics questions. After filtering out for complete cases on background covariates, we are left with $146$ questions, $r=43$ unique countries, $p=106$ covariates, and $6808$ schools. Therefore, we had a total of $n=6808$ responses after averaging over the students and questions within the schools.

To demonstrate our methodology, we apply all three procedures on the data. When applying our methodology, we tune the sparsity using the exponential screening prior of \citeArigollet2011 and set $\alpha=4\left\|Y\right\|^{2}/n$ . The high-dimensional $F$ -test rejected the null hypothesis that $\sigma_{\nu}^{2}=0$ and a $95\%$ confidence interval for $\sigma_{\nu}^{2}$ is $(0.0020,0.0051)$ , which suggests that, even controlling for school background characteristics, the country of the school impacts mathematical ability. For the last part, we define a country’s background characteristics $X$ to be the arithmetic average of all the schools’ background characteristics within that country. Then, applying the empirical Bayes procedure, we rank the countries based on the predicted number of questions they would answer correctly. The top five countries in order from our analysis are South Korea, Hong Kong, Singapore, Chinese Taipei, and Japan. Up to some reordering, our results are mostly consistent with the report of \citeAmullis2016 based on individual student data, who had the same top five countries.

6 Discussions

In this paper, we considered three problems related to high-dimensional linear mixed models. Throughout, we used exponential weighting to perform high-dimensional prediction due to its nice theoretical properties that do not require any assumption on the design matrix. However, exponential weighting may be replaced with another procedure as long as we also satisfy the conditions for an oracle inequality for prediction, such as assuming compatibility for the lasso. The theory for all three problems will remain unchanged. In particular, the lasso has the advantage of being a convex optimization problem which is easily implemented by standard software.

The main drawback to our approach is the assumption that both the random effects and the error distributions are Gaussian, which is used to maintain independence after constructing various projections. However, there are many situations when this is not a reasonable assumption and other procedures will need to be developed. We view our contribution as a step towards these problems, similar to the development of the theory for linear mixed models in the low-dimensional setting.

7 Proofs

7.1 Proofs for Section 2

Proof of Theorem 1.

Under the null hypothesis, we have that

[TABLE]

which is a standard high-dimensional linear model. By assumption (A2) and Theorem 5 of \citeAleung2006,

[TABLE]

By the triangle inequality, we have that

[TABLE]

By equations (11) and (12),

[TABLE]

Squaring both sides yields

[TABLE]

Similarly, for the denominator,

[TABLE]

Noting that the matrices $A$ and $B$ have orthonormal rows and are mutually orthogonal, independence of $A\varepsilon$ and $B\varepsilon$ follows from fact that $\varepsilon$ is Gaussian. Dividing finishes the proof. ∎

Proof of Theorem 2.

Temporarily, we will start by considering the numerator for $F_{\text{EW}}$ . Let $Q\xi\triangleq QZ\nu+Q\varepsilon\sim\mathcal{N}_{v}\left(0_{v},QZ\Psi Z^{\mathsf{T}}Q^{\mathsf{T}}+\sigma_{\varepsilon}^{2}I_{v}\right)$ . By an identical line of reasoning as in the proof of Theorem 1, it follows that

[TABLE]

We would like to remark that the $o_{\mathbb{P}}(1/\sqrt{n})$ as opposed to $o_{\mathbb{P}}(1)$ in Theorem 1 comes from assumption (A $2^{\ast}$ ) and we have applied Lemma 3 of the present paper as opposed to Theorem 5 from \citeAleung2006. Similarly, for the denominator,

[TABLE]

Thus, this yields

[TABLE]

We would like to show that

[TABLE]

It suffices to show that

[TABLE]

We will start by providing an upper bound on $F_{n_{a},n_{b},\delta}$ . From Lemma 1 of \citeAlaurent2000, it follows that, for a $\chi^{2}_{d}$ random variable,

[TABLE]

Therefore, it follows that, for any $x,y>0$ ,

[TABLE]

By choosing $x,y>0$ such that

[TABLE]

then

[TABLE]

Let $a,b>0$ be constants that will be chosen later. Define the event $\mathscr{T}$ as

[TABLE]

Again, by Lemma 1 of \citeAlaurent2000,

[TABLE]

Now,

[TABLE]

From assumption (A $2^{\ast}$ ), we have that $n_{a}\asymp n_{b}\asymp n$ . Let

[TABLE]

Then, assuming that

[TABLE]

it will follow that

[TABLE]

which finishes the proof by tuning $a,b>0$ and $h$ sufficiently large appropriately. ∎

Proof of Lemma 3.

Let $P_{m}$ denote the projection onto $X_{m}$ and ${P}^{\perp}_{m}$ the projection onto the orthogonal complement. Let $r_{m}$ as

[TABLE]

We will choose an arbitrary sequence of weakly sparse sets $S\in\mathcal{S}_{\mu}$ satisfying

[TABLE]

Fix $t>0$ arbitrarily and define the set $\mathcal{A}_{t}$ as follows

[TABLE]

Looking at the squared norm, we will start with the convexity of the norm and the triangle inequality to obtain that

[TABLE]

We will consider each of the three terms separately. For the first term, it follows from the definition of $\mathcal{A}_{t}$ that

[TABLE]

For the second term, we have that

[TABLE]

Fix $m\in\mathcal{A}_{t}^{\mathsf{c}}$ temporarily. Then, we have that

[TABLE]

Therefore, by the generalized Hölder inequality,

[TABLE]

Computing each of the three Laplace transforms separately, since

[TABLE]

it follows that

[TABLE]

Similarly,

[TABLE]

For the other term, observe that

[TABLE]

where $\preceq$ denotes stochastic ordering. Hence,

[TABLE]

Since $\alpha>6\lambda$ and $m\in\mathcal{A}_{t}^{\mathsf{c}}$ , it will follow that

[TABLE]

By assumption,

[TABLE]

for some constant $C>0$ . Hence,

[TABLE]

since $\log\left(|\mathcal{M}_{u}|\right)=o(n)$ . This implies that

[TABLE]

For the last term, fix a value of $K>0$ large to be determined later. Define the event $\mathscr{T}$ as

[TABLE]

We will first provide an upper bound on $\mathbb{P}(\mathscr{T}^{\mathsf{c}})$ . Indeed,

[TABLE]

Let $\varepsilon\sim\mathcal{N}_{n}(0_{n},I_{n})$ and define $Q_{m}\triangleq\Sigma^{1/2}P_{m}\Sigma^{1/2}$ . Fixing $m\in\mathcal{M}_{u}$ , we have by the Hanson-Wright inequality (Theorem 1.1 of \citeArudelson2013) that

[TABLE]

for some universal constant $c>0$ . Note that

[TABLE]

Since $Q_{m}$ is a real symmetric matrix, it admits a Spectral Decomposition with maximal eigenvalue bounded by $\lambda$ and rank at most $s$ . Hence, this gives

[TABLE]

Directly computing the expectation,

[TABLE]

This gives

[TABLE]

This implies that

[TABLE]

Now,

[TABLE]

Computing each term separately, we have, by the definition of $\mathscr{T}$ , that

[TABLE]

For the other term, note that

[TABLE]

where the second inequality is due to Cauchy-Schwarz and the limit follows from equation (13), assuming that $K>0$ was chosen sufficiently large. Therefore, we have shown that

[TABLE]

Since $t>0$ was arbitrary, this proves the first claim. For the second claim, note that the proof is identical except for defining the set $\mathcal{A}_{t}$ as

[TABLE]

This finishes the proof. ∎

7.2 Proofs for Section 3

Proof of Theorem 4.

Let $\xi\triangleq D^{-1/2}V^{\mathsf{T}}AZ\nu+D^{-1/2}V^{\mathsf{T}}A\varepsilon$ . From the proof of Theorem 2, it is easy to see that

[TABLE]

Therefore, we immediately have that

[TABLE]

We will consider each random variable separately. Indeed, by the Central Limit Theorem,

[TABLE]

or, equivalently,

[TABLE]

For the other term, we will verify Lindeberg’s condition. Define

[TABLE]

Let $\left(\zeta_{k}\right)_{k=1}^{n_{a}}$ be a sequence of i.i.d. $\chi^{2}_{1}$ random variables. Then, fixing $\delta>0$ ,

[TABLE]

where $C$ is the square root of the fourth central moment of a $\chi^{2}_{1}$ random variable. The second inequality is an application of Cauchy-Schwarz and the limit follows from assumption (B1). Therefore, by the Lindeberg Central Limit Theorem,

[TABLE]

Equivalently,

[TABLE]

By the independence of $\xi$ and $B\varepsilon$ and assumption (B2), the desired result now follows by differencing. ∎

Proof of Proposition 5.

Indeed, by the Law of Large Numbers,

[TABLE]

Hence, it is clear that $\hat{\sigma}_{b}^{2}\overset{\mathbb{P}}{\to}\sigma_{b}^{2}$ . The fact that $\hat{d}\overset{\mathbb{P}}{\to}\bar{d}$ is assumption (B2). Finally, for $\hat{\sigma}_{a}^{2}$ , it follows from Theorem 4 that

[TABLE]

Now, define $\hat{s}_{n_{a}}$ as

[TABLE]

It suffices to show that

[TABLE]

where $s^{2}_{n_{a}}$ is defined in the proof of Theorem 4. Then, we have that

[TABLE]

We will consider each of the two terms separately. Applying Cauchy-Schwarz to the first term, it follows that

[TABLE]

For the second term, note that

[TABLE]

which finishes the proof. ∎

7.3 Proofs for Section 4

Proof of Proposition 6.

Indeed, the consistency of $\hat{\sigma}_{\varepsilon}^{2}$ is by the Law of Large Numbers since

[TABLE]

Let $\left(\zeta_{k}\right)_{k=1}^{n_{a}}$ be a sequence of i.i.d. $\chi^{2}_{1}$ random variables. For $\hat{\sigma}_{\nu}^{2}$ , we have that

[TABLE]

Considering each term separately, we have, by the Law of Large Numbers and assumption (C1),

[TABLE]

For the second term, by a direct variance calculation,

[TABLE]

where the limit uses assumption (C1). Hence,

[TABLE]

Combining everything together shows that

[TABLE]

The proof for $\hat{\sigma}_{\gamma}^{2}$ is analogous and is omitted, which finishes the proof. ∎

Proof of Lemma 7.

By independence, the joint distribution of $(\nu,\gamma,\varepsilon)$ is given by

[TABLE]

Therefore, the joint distribution of $Y$ and $\eta$ is given by

[TABLE]

By standard results on the conditional mean of $\eta|Y$ , it follows that

[TABLE]

which finishes the proof. ∎

Proof of Theorem 8.

From Proposition 6, it follows that

[TABLE]

where $\delta_{\nu}^{2}$ , $\delta_{\gamma}^{2}$ , and $\delta_{\varepsilon}^{2}$ are all $o_{\mathbb{P}}(1)$ . We will write $\delta_{\ast}^{2}\triangleq\max\{\delta_{\nu}^{2},\delta_{\gamma}^{2},\delta_{\varepsilon}^{2}\}$ . Define

[TABLE]

Note that $\Sigma$ is positive definite and $\Delta$ is invertible almost surely. Then, by the Matrix Inversion Lemma,

[TABLE]

Some algebra will yield

[TABLE]

where we have added and subtracted $\mu$ and applied the Matrix Inversion Lemma. From Lemma 7, it follows that

[TABLE]

Define $\xi\triangleq\hat{\eta}_{\text{EW}}-\hat{\eta}_{\text{oracle}}$ . Therefore,

[TABLE]

We will prove each of the two terms on the right hand side are $o_{\mathbb{P}}(1)$ . Before doing so, we will prove a few useful facts that will facilitate the remainder of the proof.

(I)

$\left\|ZZ^{\mathsf{T}}\right\|^{2}=\mathcal{O}(1)$ . 2. (II)

$\left\|Z\nu+W\gamma+\varepsilon\right\|^{2}=\mathcal{O}_{\mathbb{P}}(n)$ and $\left\|Z\nu\right\|^{2}=\mathcal{O}_{\mathbb{P}}(n)$ . 3. (III)

$\left\|\Delta\right\|^{2}=o_{\mathbb{P}}(1)$ and $\left\|\left(\Sigma+\Delta\right)^{-1}\right\|^{2}=\mathcal{O}_{\mathbb{P}}(1)$ .

(I) is immediate since $\Sigma$ is the sum of three positive semi-definite matrices, one of which being $\sigma_{\nu}^{2}ZZ^{\mathsf{T}}$ , and $\left\|\Sigma\right\|^{2}=\mathcal{O}(1)$ by assumption (C3). The first part of (II) follows from the fact that

[TABLE]

where $\preceq$ denotes stochastic ordering and $\lambda_{\text{max}}\left(\Sigma\right)=\mathcal{O}(1)$ by assumption (C3). The second part of (II) is identical except we apply (I) to bound the maximal eigenvalue. Finally, for (III), the first part follows from $\delta_{\ast}^{2}\overset{\mathbb{P}}{\to}0$ and $\left\|\Sigma\right\|=\mathcal{O}(1)$ . For the second part, note that

[TABLE]

Then, $\left\|\Sigma^{-1}\Delta\right\|\overset{\mathbb{P}}{\to}0$ . Therefore,

[TABLE]

This proves all three claims. We can now show that $\left\|\xi\right\|^{2}=o_{\mathbb{P}}(n)$ , which we will show in parts. To that end, note that

[TABLE]

In the last line, we have used (I) and (III) from above and Lemma 3. Similarly,

[TABLE]

The last line follows from the fact that $\delta_{\nu}^{2}\overset{\mathbb{P}}{\to}0$ and (I), (II), and (III) from above. For the last term, we will apply the Matrix Inversion Lemma again to obtain

[TABLE]

Hence,

[TABLE]

where (I), (II), and (III) have all been used from above. Combining these three results with the triangle inequality, this proves that

[TABLE]

For the other quantity, we have that

[TABLE]

Applying all three facts from above demonstrates that

[TABLE]

Using the Cauchy-Schwarz inequality yields

[TABLE]

which finishes the proof. ∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bradic et al. (2017) Bradic, J., Claeskens, G., & Gueuning, T. (2017). Fixed effects testing in high-dimensional linear mixed models. ar Xiv preprint ar Xiv:1708.04887 .
2Brown et al. (2018) Brown, L. D., Mukherjee, G., & Weinstein, A. (2018). Empirical bayes estimates for a two-way cross-classified model. The Annals of Statistics , 46 (4), 1693–1720.
3Cai & Guo (2017) Cai, T. T. & Guo, Z. (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics , 45 (2), 615–646.
4Chen et al. (2015) Chen, F., Li, Z., Shi, L., & Zhu, L. (2015). Inference for mixed models of anova type with high-dimensional data. Journal of Multivariate Analysis , 133 , 382–401.
5Groll & Tutz (2014) Groll, A. & Tutz, G. (2014). Variable selection for generalized linear mixed models by l 1-penalized estimation. Statistics and Computing , 24 (2), 137–154.
6Jiang (2007) Jiang, J. (2007). Linear and generalized linear mixed models and their applications . Springer Science & Business Media.
7Laurent & Massart (2000) Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 1302–1338.
8Leung & Barron (2006) Leung, G. & Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory , 52 (8), 3396–3410.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Estimating the Random Effect in Big Data Mixed Models

Abstract

1 Introduction

1.1 Organization of the Paper

1.2 Notation

Definition 1**.**

2 A High-Dimensional FFF-Test: Testing ν≡0\nu\equiv 0ν≡0

2.1 Model and Motivation

2.2 Estimator

2.3 Assumptions

Remark 1*.*

2.4 Main Results for FEWF_{\text{EW}}FEW​

Theorem 1**.**

Remark 2*.*

Theorem 2**.**

Lemma 3**.**

Remark 3*.*

3 Confidence Intervals for σν2\sigma_{\nu}^{2}σν2​

3.1 Model and Motivation

3.2 Estimator

3.3 Assumptions

Remark 4*.*

3.4 Main Results for σ^ν2\hat{\sigma}_{\nu}^{2}σ^ν2​

Theorem 4**.**

Proposition 5**.**

4 Empirical Bayes in ANOVA Type Models

4.1 Model and Motivation

4.2 Estimator

4.3 Assumptions

Remark 5*.*

4.4 Main Results for η^EW\hat{\eta}_{\text{EW}}η^​EW​

Proposition 6**.**

Lemma 7**.**

Theorem 8**.**

5 Implementation and Simulations

5.1 Implementation

5.2 Simulations

5.2.1 Methods and Models

5.2.2 Results

5.3 Real Data Application

6 Discussions

7 Proofs

7.1 Proofs for Section 2

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Lemma 3.

7.2 Proofs for Section 3

Proof of Theorem 4.

Proof of Proposition 5.

7.3 Proofs for Section 4

Proof of Proposition 6.

Proof of Lemma 7.

Proof of Theorem 8.

Definition 1.

2 A High-Dimensional $F$ -Test: Testing $\nu\equiv 0$

*Remark 1**.*

2.4 Main Results for $F_{\text{EW}}$

Theorem 1.

*Remark 2**.*

Theorem 2.

Lemma 3.

*Remark 3**.*

3 Confidence Intervals for $\sigma_{\nu}^{2}$

*Remark 4**.*

3.4 Main Results for $\hat{\sigma}_{\nu}^{2}$

Theorem 4.

Proposition 5.

*Remark 5**.*

4.4 Main Results for $\hat{\eta}_{\text{EW}}$

Proposition 6.

Lemma 7.

Theorem 8.