Estimating the Random Effect in Big Data Mixed Models
Michael Law, Ya'acov Ritov

TL;DR
This paper develops new statistical methods for high-dimensional Gaussian linear mixed models, including tests, confidence intervals, and estimators for random effects, with applications to educational data analysis.
Contribution
It introduces an asymptotic F-test, confidence interval, and empirical Bayes estimator for random effects in high-dimensional mixed models, without fixed effect design assumptions.
Findings
The F-test asymptotically controls type I error.
The confidence interval achieves parametric rate $\
,
Abstract
We consider three problems in high-dimensional Gaussian linear mixed models. Without any assumptions on the design for the fixed effects, we construct an asymptotic -statistic for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data.
| 0 | 0 | 0 | 0 | 0.8 | 0.8 | 0.8 | 0.8 | ||
| 25 | 25 | 50 | 50 | 25 | 25 | 50 | 50 | ||
| 25 | 50 | 25 | 50 | 25 | 50 | 25 | 50 | ||
| 92 | 91 | 94 | 98 | 95 | 95 | 99 | 93 | ||
| AveCov | 92 | 91 | 95 | 98 | 96 | 95 | 97 | 94 | |
| 91 | 92 | 95 | 84 | 93 | 100 | 82 | 99 | ||
| 97 | 99 | 99 | 100 | 98 | 96 | 99 | 97 | ||
| AveCov | 100 | 100 | 98 | 98 | 100 | 100 | 99 | 99 | |
| 100 | 100 | 100 | 99 | 100 | 100 | 97 | 100 | ||
| 0.33 | 0.32 | 0.31 | 0.34 | 0.33 | 0.31 | 0.36 | 0.35 | ||
| AveLen | 0.12 | 0.16 | 0.26 | 0.43 | 0.13 | 0.15 | 0.25 | 0.35 | |
| 0.12 | 0.17 | 0.29 | 0.84 | 0.12 | 0.13 | 0.26 | 0.34 | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| AveLoss | 0.02 | 0.02 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.03 | |
| 0.07 | 0.15 | 0.23 | 1.00 | 0.32 | 0.32 | 0.32 | 0.39 |
| 0 | 0 | 0 | 0 | 0.8 | 0.8 | 0.8 | 0.8 | ||
| 25 | 25 | 50 | 50 | 25 | 25 | 50 | 50 | ||
| 25 | 50 | 25 | 50 | 25 | 50 | 25 | 50 | ||
| 95 | 96 | 96 | 94 | 94 | 93 | 95 | 97 | ||
| AveCov | 96 | 96 | 97 | 94 | 95 | 92 | 97 | 98 | |
| 95 | 98 | 95 | 97 | 93 | 99 | 99 | 99 | ||
| 97 | 99 | 100 | 96 | 97 | 96 | 96 | 99 | ||
| AveCov | 100 | 100 | 100 | 96 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 99 | 98 | 100 | 100 | 99 | 99 | ||
| 0.34 | 0.35 | 0.4 | 0.4 | 0.34 | 0.38 | 0.39 | 0.39 | ||
| AveLen | 0.20 | 0.16 | 0.22 | 0.32 | 0.12 | 0.16 | 0.30 | 0.36 | |
| 0.19 | 0.17 | 0.22 | 0.55 | 0.12 | 0.14 | 0.28 | 0.35 | ||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| AveLoss | 0.04 | 0.03 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.05 | |
| 0.12 | 0.28 | 0.19 | 1.07 | 0.41 | 0.39 | 0.39 | 0.38 |
| 0 | 0 | 0 | 0 | 0.8 | 0.8 | 0.8 | 0.8 | ||
| 25 | 25 | 50 | 50 | 25 | 25 | 50 | 50 | ||
| 25 | 50 | 25 | 50 | 25 | 50 | 25 | 50 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||
| AveCov | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 98 | 97 | 100 | 100 | 100 | 100 | ||
| 96 | 95 | 96 | 95 | 95 | 94 | 96 | 96 | ||
| AveCov | 87 | 92 | 93 | 91 | 91 | 85 | 94 | 96 | |
| 80 | 87 | 89 | 95 | 62 | 42 | 51 | 73 | ||
| 0.64 | 0.66 | 0.52 | 0.52 | 0.67 | 0.63 | 0.52 | 0.53 | ||
| AveLen | 1.22 | 1.34 | 1.12 | 1.31 | 1.32 | 1.19 | 1.05 | 1.27 | |
| 1.10 | 1.26 | 1.18 | 1.70 | 0.92 | 0.83 | 0.82 | 1.02 | ||
| 0.11 | 0.12 | 0.19 | 0.19 | 0.11 | 0.11 | 0.19 | 0.19 | ||
| AveLoss | 0.14 | 0.14 | 0.22 | 0.22 | 0.14 | 0.13 | 0.22 | 0.21 | |
| 0.25 | 0.54 | 0.70 | 1.07 | 0.41 | 0.41 | 0.51 | 0.44 |
| 0 | 0 | 0 | 0 | 0.8 | 0.8 | 0.8 | 0.8 | ||
| 25 | 25 | 50 | 50 | 25 | 25 | 50 | 50 | ||
| 25 | 50 | 25 | 50 | 25 | 50 | 25 | 50 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||
| AveCov | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
| 99 | 100 | 100 | 94 | 100 | 100 | 100 | 98 | ||
| 93 | 98 | 98 | 95 | 97 | 96 | 95 | 96 | ||
| AveCov | 89 | 95 | 95 | 94 | 87 | 88 | 93 | 88 | |
| 81 | 82 | 87 | 79 | 59 | 43 | 66 | 57 | ||
| 0.68 | 0.68 | 0.69 | 0.55 | 0.68 | 0.68 | 0.56 | 0.57 | ||
| AveLen | 1.31 | 1.32 | 1.15 | 1.12 | 1.22 | 1.28 | 1.13 | 1.19 | |
| 1.20 | 1.17 | 1.10 | 1.20 | 0.88 | 0.87 | 0.89 | 0.93 | ||
| 0.13 | 0.14 | 0.22 | 0.24 | 0.14 | 0.14 | 0.22 | 0.23 | ||
| AveLoss | 0.18 | 0.18 | 0.27 | 0.28 | 0.19 | 0.19 | 0.26 | 0.28 | |
| 0.39 | 0.41 | 0.46 | 0.97 | 0.50 | 0.52 | 0.52 | 0.59 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models
Estimating the Random Effect in Big Data Mixed Models
Michael Law
Ya’acov Ritov Supported in part by NSF Grant DMS-1712962.
( University of Michigan
March 2, 2024)
Abstract
We consider three problems in high-dimensional Gaussian linear mixed models. Without any assumptions on the design for the fixed effects, we construct an asymptotic -statistic for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data.
1 Introduction
In the past two decades, there has been a lot of progress in the theory for high-dimensional linear models. However, its close cousin, the high-dimensional linear mixed model, has received significantly less attention; it was not until the past decade until there were procedures for estimation. Consider a linear mixed model given by
[TABLE]
with , , and . In addition, we observe covariates such that is a small remainder term for some sparse vector (see Section 1.2 for a rigorous definition). Therefore, the model may be written as
[TABLE]
Here, is a design corresponding to the fixed effects and to the random effects. We consider the setting where the random effects are low-dimensional, , but the fixed effects are high-dimensional, . We have separated the random effects in two to emphasize later that we are interested in and view as nuisance parameters. Various authors have considered different aspects of this problem.
The earliest work of \citeAschelldorfer2011 proposed an estimator for both and the variance components using a lasso-type approach. These types of approaches were later extended by several authors who considered estimation with both convex, such as \citeAgroll2014, and non-convex penalties, such as \citeAwang2012. There has also a growing literature on model selection in high-dimensional linear mixed models (cf. \citeAmuller2013 and the references therein).
The problem of inference is slightly less well studied. To the best of our knowledge, \citeAchen2015 was the first to consider hypotheses testing problems for random effects in ANOVA type settings and \citeAbradic2017 for fixed effects in high-dimensional models. During the preparation of this manuscript, we became aware of the very recent work of \citeAli2019, who consider the problem of inference in high-dimensional linear mixed models. In particular, they discuss inference for fixed effects and estimation of variance components.
The goal of the present paper is to contribute to this growing literature on high-dimensional linear mixed models, both in terms of estimation and inference when the random effects and error are all Gaussian. In particular, we consider three related problems:
Testing whether . 2. 2.
Constructing confidence intervals for when . 3. 3.
Estimating using empirical Bayes in ANOVA Type Models.
Our methodology is inspired by both low-dimensional linear mixed models as well as high-dimensional linear models. Specifically, our approach to all three problems starts with considering a procedure in the corresponding low-dimensional problem and retrofitting it with tools and techniques from high-dimensional linear models to produce a procedure for the high-dimensional linear mixed model.
1.1 Organization of the Paper
We will end the current section with a description of the notation that we will adopt throughout the paper. Sections 2, 3, and 4 consider each of the three problems outlined in the Introduction respectively. Each one starts with a description of the problem setup, a brief motivation from the low-dimensional problem, a description of the estimator that will be considered, and ends with some theoretical results. In Section 5, we provide an overview of the computation of the estimators as well as simulations and a real data application. We conclude with discussions in Section 6. For the ease of presentation, we defer all proofs to Section 7.
1.2 Notation
Throughout, all of our variables will have a dependence on . However, when this will not cause confusion, we will suppress this dependence. Let denote the standard (unscaled) Euclidean norm with the dimension of the space being implicit from the vector and the -“norm” on . We will also use to denote the norm for matrices and to denote the Hilbert-Schmidt norm. Furthermore, and will denote the maximal and minimal eigenvalue respectively.
Consistent with other high-dimensional works, we will assume that is a sparse vector. There are various notions of sparsity but we will assume the slightly more general setting of weak sparsity. For , we will let denote the collection of all models with the dimension of the fixed effects design equal to . For a model , will denote the sub-matrix of corresponding to the columns indexed by . Moreover, will denote the projection onto and the projection onto the orthogonal complement. We will consider the following definition of weak sparsity.
Definition 1**.**
The vector is said to satisfy the weak sparsity property relative to with sparsity at rate if the set
[TABLE]
is non-empty. A set is said to be a weakly sparse set for the vector .
Then, we will let denote any weakly sparse set for .
2 A High-Dimensional -Test: Testing
2.1 Model and Motivation
Consider the high-dimensional linear mixed model given in equation (1). For the fixed effects , we will make no assumption on the design besides independence with . For the random effects, we will similarly make no assumption on , except for the fact that is independent of . We will assume that for some positive constant and for some symmetric positive semi-definite matrix . No assumption will be necessary on as the nuisance parameters will be projected out in the first stage.
We are interested in the hypotheses testing problem
[TABLE]
Suppose temporarily that we are in the low-dimensional setting with , , and . Then, in this low-dimensional problem, with and both distributed as Gaussian, the standard procedure for testing whether is through the -test. Letting denote the projection onto with and denote the projection onto with , we have that
[TABLE]
Then, under the null hypothesis, the above statistic has an distribution. It is known that the -test has certain optimality properties for certain classes of hypotheses testing problems (cf \citeAjiang2007 and the references therein for more details).
The main obstacle to directly using this -test in the high-dimensional setting is removing the contribution of the fixed effects. One possibility is to perform model selection and choose the relevant covariates from and then use the low-dimensional -test. \citeAchen2015 consider a similar problem in the high-dimensional setting and they use a SCAD based approach for variable selection. As a consequence, they require . Instead, we leverage the fact that a projection onto a particular space is a regression onto a design whose columns span that same space.
Expanding both the numerator and the denominator of the classical -statistic, we have that
[TABLE]
In both matrices above, they project onto the orthogonal complement of , which may still be achieved in the high-dimensional problem since is a low-dimensional matrix. We may find two matrices, and , such that
[TABLE]
where and have orthonormal rows that are mutually orthogonal (ie. ). For simplicity, we will write
[TABLE]
Rewriting, we have that
[TABLE]
If was low-dimensional, to obtain the projection of onto the orthogonal complement of , this is equivalent to finding the residuals of using the covariates ; this yields . Then, we have that
[TABLE]
Hence, this recasts the problem into one of high-dimensional prediction, for which there have been many procedures suggested to estimate , such as the lasso and exponential weighting (cf. \citeAtibshirani1996 and \citeAleung2006 respectively). Therefore, we propose using a plug-in estimator for using exponential weighting of all models of a particular size and then consider the resultant residuals. This idea, under mild assumptions, will provide an asymptotic -test.
2.2 Estimator
Rather than using exponential weighting for and separately, we will estimate the mean vector jointly; the advantages to this approach are twofold. On the one hand, it will significantly simplify our implementation as we will only need a single Markov chain (see Section 5) and, on the other, the estimation seems to perform better empirically since we pool more observations together. Therefore, we will fix a sequence of constants . Let denote the least-squares estimator of using the model with covariates . We will define our set of exponential weights by
[TABLE]
where . We will estimate by
[TABLE]
The bound on is to ensure the consistency result of Lemma 3 under both the null and the alternative hypotheses. The subscript is to emphasize that the estimator is constructed only using the projected sub-data from (6). Then, the corresponding -statistic will be
[TABLE]
Similar to the classical -statistic, we will reject the null hypothesis for large values of . In particular, for a value , let denote the upper quantile of an distribution. Then, we will consider tests of the form
[TABLE]
2.3 Assumptions
We will assume without the loss of generality that the columns of have squared norm that is . Moreover, we will assume that
- (A1)
The vector satisfies
[TABLE]
Furthermore, . 2. (A2)
The sequence of constants satisfies
[TABLE]
- (A)
The sequence of constants and the number of observations in the reduced models, and , satisfy
[TABLE]
Remark 1*.*
There are two assumptions regarding the sparsity and the number of random effects. The first assumption, (A2), is quite weak in terms of the sparsity level and is used to establish the distribution of the estimator under the null hypothesis. Considering the models in equations (4) and (5), which have and observations respectively, the sparsity only needs to be of smaller order than the number of observations in each of the two reduced models. This is known to be the optimal rate at which the mean vector can be estimated in a high-dimensional linear model (for example, cf. \citeAraskutti2011 or \citeArigollet2011). The requirement that is only to ensure that the models we consider for exponential weighting are sufficiently large to remove the fixed effects.
The stronger assumption, (A), is used when considering the power of the test under a specific contiguous alternative (see equation (7)), which gives the parametric rate of . However, in cases where we have a different sequence of alternatives, the assumption may be weakened accordingly. Note that this is similar to the setting of the de-biased lasso, which requires a ultra sparsity to achieve the parametric rate in testing a single covariate (cf. \citeAcai2017).
2.4 Main Results for
We start by stating the asymptotic distribution of under the null hypothesis.
Theorem 1**.**
Consider the linear mixed model given by equation (1) and the hypotheses testing problem from equation (2). Assume (A2). Under the null hypothesis and ,
[TABLE]
where .
This naturally leads to the question regarding the power of the testing procedure. In particular, between what contiguous alternatives can we distinguish? We will consider the following contiguous testing problem.
[TABLE]
Remark 2*.*
Suppose that corresponds to a single random effect and the design is balanced, with . Then, the above hypotheses becomes
[TABLE]
which is a standard hypotheses testing problem, such as in the balanced one-way random effects model. In this model, in the low-dimensional setting, the rate of is optimal.
Theorem 2**.**
Consider the linear mixed model given by equation (1) and the hypotheses testing problem from equation (7). Assume further (A1) and (A). Fix a value of . Under the alternative hypothesis with sufficiently large (not depending on ) and , the sum of type I and type II error for the test statistic is between zero and one.
To prove the theorem, we will need the following lemma that provides consistency of under the setting of correlated errors, which may be of interest in its own right.
Lemma 3**.**
Consider a high-dimensional linear model given by
[TABLE]
with for some covariance matrix satisfying . Assume that is weakly sparse relative to with sparsity at rate and . Assume further . Letting denote the least-squares estimator for using the covariates , define the exponential weights as
[TABLE]
with . Then,
[TABLE]
If in addition is weakly sparse relative to with sparsity at rate and , then
[TABLE]
Remark 3*.*
The above lemma slightly generalizes Theorem 5 of \citeAleung2006 to heteroskedastic and dependent, but still Gaussian, errors. However, in the setting where the errors are independent and identically distributed, the result of \citeAleung2006 is stronger as their bound for is smaller.
3 Confidence Intervals for
3.1 Model and Motivation
In the previous section, we considered the problem of testing a collections of random effects. However, it is often of interest to construct confidence intervals for the variance of a particular random effect. Suppose that . In the low-dimensional setting, there have been many procedures suggested to construct confidence intervals, from likelihood based approaches to -test inversions (for example, see \citeAjiang2007 and the references therein for a non-exhaustive list).
We take an approach similar to variance estimation in linear models. Consider the two reduced models from Section 2, which are reproduced below for convenience:
[TABLE]
The second model is a standard high-dimensional linear model with . In the low-dimensional setting, would be estimated by mean squared error, which is computed through projecting onto the orthogonal complement of . Hence, we may achieve a similar result by viewing this as a prediction problem and regressing out . For the other model, note that we have a high-dimensional linear model with noise . Since the matrix and are both observed, we may whiten the data with respect to by taking the Spectral Decomposition of and left multiplying by . Here, is diagonal and is unitary. This yields
[TABLE]
Then, the above can be viewed as a high-dimensional linear model with noise given by . Since is observed and can be estimated, this will provide an estimator of , which we may then use to construct a pivot.
3.2 Estimator
As outlined previously, we will need to construct two estimators, one for and one for . Again, we will jointly estimate the mean vector . Therefore, we will start by defining
[TABLE]
Then, will be given by
[TABLE]
3.3 Assumptions
In addition to the assumptions from Section 2, we will need two additional assumptions regarding the distribution of the values of the diagonal matrix .
- (B1)
Letting , the norms satisfy
[TABLE] 2. (B2)
The sample sizes and satisfy
[TABLE]
Remark 4*.*
Assumption (B1) on the matrix is to ensure that the design is sufficiently well balanced and that is distinct enough from . This assumption is satisfied in general settings such as relatively balanced ANOVA models.
Assumption (B2) is a technical regularity condition in view of the triangular array framework. By possibly reducing to a convergent sub-sequence, we may always satisfy assumption (B2).
3.4 Main Results for
We will start by stating the asymptotic distribution of .
Theorem 4**.**
Consider the linear mixed model given by equation (1), with the reduced models given in equations (5) and (8). Assume further (A1) with , (A), (B1), and (B2). Then,
[TABLE]
From the previous theorem, we immediately have that the following is a confidence interval for .
[TABLE]
In general, the values of , , and are unknown quantities and need to be estimated. The following proposition yields consistent estimators for each of those quantities.
Proposition 5**.**
Under the assumptions of Theorem 4,
[TABLE]
Therefore, we may plug in the values of , , and into equation (9) to obtain an asymptotic confidence interval.
4 Empirical Bayes in ANOVA Type Models
The motivating example of this problem framework is in terms of the Rasch model, originally proposed by \citeArasch1960. The model that we consider is slightly different than the classical Rasch model in that we have Gaussian responses as opposed to binary responses.
As an example of this model, the data that we will consider in Section 5.3 is from the Trends in Mathematics and Sciences Study (TIMSS), an international study conducted every four years to measure fourth and eighth grade student achievement in mathematics and science. We will only consider data from the year 2015. Countries randomly sample a collection of nationally representative schools to take standardized examinations in both mathematics and science, with questions being either multiple choice or constructed response. Then, each student within schools takes only a subset of the questions on the exams but all questions are answered by some students in each school. In addition to recording student responses, the data also contains background covariates for schools. \citeAmartin2016 provides a more detailed description of the methods and procedures employed by TIMSS and more general information about TIMSS is available in \citeAmullis20years.
For our analysis, we will only consider multiple choice questions and analyze on the level of school rather than students. To construct a response variable for school, we compute the proportion of questions answered correctly by students in that school. Note that, unlike the classical Rasch model, we assume a linear model and, for all schools, we have answers for all questions. Thus, by the Central Limit Theorem, our response will be approximately Gaussian. The fixed effects design will be the background covariates for the school and the random effects design will be an indicator for the country, with corresponding to the unobserved variability of the countries. In this example, we do since we have averaged over questions, we do not have any nuisance random effects. The problem that we will consider in this section is ranking the countries based on mathematical ability and trying to estimate the average number of questions that any particular country will answer correctly. That is, we would like to estimate for all countries.
4.1 Model and Motivation
The general problem framework that we consider is for -factor ANOVA models. However, we will derive the results in the setting when . That is, we will consider the model
[TABLE]
We do not assume that the design is fully crossed in the random effects. The goal in the problem is to estimate a subset of the mean vector, , since we view the random effects as nuisance. However, as the sample size increases, the number of observations per group stays bounded. In the context of the motivating data example, each country still only answers a finite number of questions as we increase the sample size. Therefore, it is not possible to consistently estimate the entire vector . A standard approach in the low-dimensional setting would be to use an empirical Bayes estimator by placing a Gaussian prior on both and (for example, see \citeAbrown2018), which would then transform the problem into a standard high-dimensional linear mixed model. Therefore, we will use a and prior on and respectively.
Since we will need to estimate both and for the prior, our estimator for will be analogous to from Section 3. To this end, we will need an additional matrix such that
[TABLE]
Further, may be chosen such that has orthonormal rows that are mutually orthogonal with . We will let denote the number of rows of . Similarly, we will need to consider the Spectral Decomposition of . Similarly, we have that is diagonal and is unitary. Then, we may take the transformation
[TABLE]
to whiten the data with respect to .
4.2 Estimator
To estimate , we will define to be the exponentially weighted estimator for model (1), with .
From Theorem 4, we immediately have that . This suggests the empirical prior for . However, we will not use this estimator, but a slightly modified version by jointly estimating the vector using all of the observations. Before we define a modified version of , we will first need to adapt our estimator of , which will be given by
[TABLE]
Then, we may define
[TABLE]
We may similarly define as
[TABLE]
Then, under some assumptions, it will follow that analogous to Theorem 4 where is defined in assumption (C1). Hence, we will estimate by , by , and by . This will give the following empirical Bayes estimator for ,
[TABLE]
To compare our estimator, we will consider an oracle that has access to , , , and . Then, this oracle will use the Bayes estimator for (see Lemma 7), given by
[TABLE]
4.3 Assumptions
Unlike the previous section, we will not need to establish the asymptotic distribution of , rather we only need the estimator to be consistent. Accordingly, we may weaken our assumptions to the following
- (C1)
The matrices and satisfy
[TABLE] 2. (C2)
The sequence of constants satisfies
[TABLE] 3. (C3)
The design matrices satisfy
[TABLE]
Remark 5*.*
Assumption (C1) is weaker than the combination of assumptions (B1) and (B2). This will ensure consistency in estimating , , and .
The next assumption, (C2), is identical to assumption (A2) but with the additional constraint on to ensure consistent prediction of in equation (10).
The last assumption, (C3), is equivalent to, in the context of the motivating example, each school only answering number of questions and each question being answered by number of schools.
4.4 Main Results for
We will start this section by proving that , , and are all consistent estimators.
Proposition 6**.**
Consider the linear mixed model given by equation (1), with the reduced models given in equations (5), (8), and (10). Then, assuming (C1) and (C2),
[TABLE]
The following is a standard lemma regarding the empirical Bayes estimators in this problem setup, which we will prove for the sake of completeness.
Lemma 7**.**
For a fixed vector and fixed values , , and , the Bayes estimator of is given by
[TABLE]
We conclude this section with the main result regarding , that the empirical Bayes estimator performs nearly as well as the oracle Bayes estimator asymptotically.
Theorem 8**.**
Consider the linear mixed model given in (1). Assume (C1), (C2), and (C3). Then,
[TABLE]
5 Implementation and Simulations
5.1 Implementation
In this section, we describe how to compute all three of the proposed estimators. The computation can be split into two halves:
Obtaining the matrices , , , , , , and with which to project the data. 2. 2.
Computing the estimators via exponential weighting.
For the first part, the matrices , , and can be obtained by adding the standard basis vectors in to to form a basis for and applying the modified Gram-Schmidt procedure. To compute and , we form the matrices and directly apply the eigendecomposition to obtain both the eigenvalues and the eigenvectors. An analogous procedure is used to compute and .
For the second part, it may be further sub-divided into three parts:
Determining a bound . 2. 2.
Determining a value of . 3. 3.
Computing models.
To determine , we only need an upper bound on the maximal eigenvalue of the noise vector. This may simply be bounded by the variance of by choosing . Though this will choose a conservative value of , asymptotically this will have no effect. For , we may following the suggestion of \citeArigollet2011 and apply the exponential screening prior. They say that a covariate is selected if , where is the ’th entry of the exponential screening estimator. Note that, after computing the matrix , the model is a standard high-dimensional linear model with the same level of sparsity as the original model, which is in the framework of \citeArigollet2011. Alternatively, we may apply cross-validation to tune the sparsity level, though this is very computationally intensive.
Once both are selected, we still need to compute least-squares estimators. However, \citeArigollet2011 and \citeArigollet2012 proposed a Metropolis Hastings approach to approximating the exponential weighting estimator. We will refer the interested reader to \citeArigollet2012 for the details.
5.2 Simulations
5.2.1 Methods and Models
We will consider the linear mixed model given by
[TABLE]
with , , and . For simplicity, we will set . Since no assumption is made on the fixed effects design , we will generate from a multivariate Gaussian , where is an equi-correlation matrix. That is,
[TABLE]
for . As our fixed effects design is perfectly symmetric, we will take the strongly sparse set to be the first components with for . We will fix the value of . For simplicity, we will let and each correspond to a single random effect that are independent of each other. Therefore, we will have that
[TABLE]
Finally, with regards to the random effects design, we will consider the setting where . Since the number of observations will be less than the number of possible combinations, we will start by generating a fully crossed design and then down sample rows without replacement of the fully crossed design. To experiment with both the level and the power of our procedure, we will consider . Hence, we will consider each of the following settings
[TABLE]
For each setting, we will generate a single draw of the design for . Then, we will draw trials of and compute each of our estimators on these trials. Whenever a variance estimate is negative, we will set the estimated value to zero. When using exponential weighting, we will estimate the sparsity using the exponential screening prior of \citeArigollet2011 on the model . Moreover, we will use the value of .
All of our simulations are conducted in R. For each of our three problems, we will compare our estimator with an oracle low-dimensional estimator as well as a low-dimensional version of our proposed high-dimensional statistic. For the oracle estimators, in the setting of the -test, we directly apply the classical low-dimensional -test that has access to the true sparse set , as given in equation (2.3) of \citeAjiang2007, which we will denote by . For the confidence intervals, we fit the linear mixed models with the true sparse set using lmer and applying the confint function. We will denote this estimator as . Finally, in the setting of estimation, we directly compute the oracle Bayes estimator described in Section 4. Collectively, things subscripted with “LD” will denote low-dimensional estimators.
In addition to comparing with the low-dimensional estimators, we also construct low-dimensional versions of our proposed high-dimensional statistics. To do so, we use the exact same statistic as in the high-dimensional setting but replace exponential weighting with least-squares using the sparse set . Hence, all of these statistics will be subscripted by “LS” to denote the usage of least-squares. We make this comparison since all of our proposed statistics rely on two layers of asymptotics:
In the prediction of the mean vector via exponential weighting. 2. 2.
In the convergence once the residuals are obtained.
To differentiate between these two, we introduce an intermediate statistic that relies on least-squares, which we think of as low-dimensional versions of our statistics. For example, recalling that is the least-squares estimator of using the covariates , we will also consider the statistic
[TABLE]
Similarly, we will consider and given by
[TABLE]
Then, we will have that
[TABLE]
and
[TABLE]
To compare the three procedures, we will consider the following metrics
Average Coverage: The percentage of time the correct hypothesis is selected for -tests or the percentage of time the true value of is in the confidence interval. 2. 2.
Average Length: The average length of the confidence interval, taken as the upper endpoint minus the lower endpoint. 3. 3.
Average Loss: The average squared Euclidean distance between the estimated vector and the true vector divided by .
5.2.2 Results
Tables (1) - (4) provide the results of our simulations. We notice that, on average, and are slightly less conservative than and respectively. However, as increases from to , we generally notice that type I error decreases and coverage increases, suggesting that the main problem that is asymptotic. In particular, even with the correct low-dimensional covariates, and still have higher type I error and lower coverage but does outperform both and , as expected. However, the confidence intervals for seem to be overly optimistic when the design is more correlated and the true value of . In regards to estimation of , the loss for is generally close to the oracle loss, as predicted by Theorem 8.
Moreover, Figures (1) and (2) show some simulated confidence intervals for trials. The fact that and occasionally have shorter confidence intervals than profile likelihood might be attributable to the more optimistic nature of those two procedures.
5.3 Real Data Application
Following in the motivating example of Section 4, we consider the Trends in International Mathematics and Sciences Study dataset, which is freely available online. To simplify our analysis, we only consider the mathematics questions. After filtering out for complete cases on background covariates, we are left with questions, unique countries, covariates, and schools. Therefore, we had a total of responses after averaging over the students and questions within the schools.
To demonstrate our methodology, we apply all three procedures on the data. When applying our methodology, we tune the sparsity using the exponential screening prior of \citeArigollet2011 and set . The high-dimensional -test rejected the null hypothesis that and a confidence interval for is , which suggests that, even controlling for school background characteristics, the country of the school impacts mathematical ability. For the last part, we define a country’s background characteristics to be the arithmetic average of all the schools’ background characteristics within that country. Then, applying the empirical Bayes procedure, we rank the countries based on the predicted number of questions they would answer correctly. The top five countries in order from our analysis are South Korea, Hong Kong, Singapore, Chinese Taipei, and Japan. Up to some reordering, our results are mostly consistent with the report of \citeAmullis2016 based on individual student data, who had the same top five countries.
6 Discussions
In this paper, we considered three problems related to high-dimensional linear mixed models. Throughout, we used exponential weighting to perform high-dimensional prediction due to its nice theoretical properties that do not require any assumption on the design matrix. However, exponential weighting may be replaced with another procedure as long as we also satisfy the conditions for an oracle inequality for prediction, such as assuming compatibility for the lasso. The theory for all three problems will remain unchanged. In particular, the lasso has the advantage of being a convex optimization problem which is easily implemented by standard software.
The main drawback to our approach is the assumption that both the random effects and the error distributions are Gaussian, which is used to maintain independence after constructing various projections. However, there are many situations when this is not a reasonable assumption and other procedures will need to be developed. We view our contribution as a step towards these problems, similar to the development of the theory for linear mixed models in the low-dimensional setting.
7 Proofs
7.1 Proofs for Section 2
Proof of Theorem 1.
Under the null hypothesis, we have that
[TABLE]
which is a standard high-dimensional linear model. By assumption (A2) and Theorem 5 of \citeAleung2006,
[TABLE]
By the triangle inequality, we have that
[TABLE]
[TABLE]
Squaring both sides yields
[TABLE]
Similarly, for the denominator,
[TABLE]
Noting that the matrices and have orthonormal rows and are mutually orthogonal, independence of and follows from fact that is Gaussian. Dividing finishes the proof. ∎
Proof of Theorem 2.
Temporarily, we will start by considering the numerator for . Let . By an identical line of reasoning as in the proof of Theorem 1, it follows that
[TABLE]
We would like to remark that the as opposed to in Theorem 1 comes from assumption (A) and we have applied Lemma 3 of the present paper as opposed to Theorem 5 from \citeAleung2006. Similarly, for the denominator,
[TABLE]
Thus, this yields
[TABLE]
We would like to show that
[TABLE]
It suffices to show that
[TABLE]
We will start by providing an upper bound on . From Lemma 1 of \citeAlaurent2000, it follows that, for a random variable,
[TABLE]
Therefore, it follows that, for any ,
[TABLE]
By choosing such that
[TABLE]
then
[TABLE]
Let be constants that will be chosen later. Define the event as
[TABLE]
Again, by Lemma 1 of \citeAlaurent2000,
[TABLE]
Now,
[TABLE]
From assumption (A), we have that . Let
[TABLE]
Then, assuming that
[TABLE]
it will follow that
[TABLE]
which finishes the proof by tuning and sufficiently large appropriately. ∎
Proof of Lemma 3.
Let denote the projection onto and the projection onto the orthogonal complement. Let as
[TABLE]
We will choose an arbitrary sequence of weakly sparse sets satisfying
[TABLE]
Fix arbitrarily and define the set as follows
[TABLE]
Looking at the squared norm, we will start with the convexity of the norm and the triangle inequality to obtain that
[TABLE]
We will consider each of the three terms separately. For the first term, it follows from the definition of that
[TABLE]
For the second term, we have that
[TABLE]
Fix temporarily. Then, we have that
[TABLE]
Therefore, by the generalized Hölder inequality,
[TABLE]
Computing each of the three Laplace transforms separately, since
[TABLE]
it follows that
[TABLE]
Similarly,
[TABLE]
For the other term, observe that
[TABLE]
where denotes stochastic ordering. Hence,
[TABLE]
Since and , it will follow that
[TABLE]
By assumption,
[TABLE]
for some constant . Hence,
[TABLE]
since . This implies that
[TABLE]
For the last term, fix a value of large to be determined later. Define the event as
[TABLE]
We will first provide an upper bound on . Indeed,
[TABLE]
Let and define . Fixing , we have by the Hanson-Wright inequality (Theorem 1.1 of \citeArudelson2013) that
[TABLE]
for some universal constant . Note that
[TABLE]
Since is a real symmetric matrix, it admits a Spectral Decomposition with maximal eigenvalue bounded by and rank at most . Hence, this gives
[TABLE]
Directly computing the expectation,
[TABLE]
This gives
[TABLE]
This implies that
[TABLE]
Now,
[TABLE]
Computing each term separately, we have, by the definition of , that
[TABLE]
For the other term, note that
[TABLE]
where the second inequality is due to Cauchy-Schwarz and the limit follows from equation (13), assuming that was chosen sufficiently large. Therefore, we have shown that
[TABLE]
Since was arbitrary, this proves the first claim. For the second claim, note that the proof is identical except for defining the set as
[TABLE]
This finishes the proof. ∎
7.2 Proofs for Section 3
Proof of Theorem 4.
Let . From the proof of Theorem 2, it is easy to see that
[TABLE]
Therefore, we immediately have that
[TABLE]
We will consider each random variable separately. Indeed, by the Central Limit Theorem,
[TABLE]
or, equivalently,
[TABLE]
For the other term, we will verify Lindeberg’s condition. Define
[TABLE]
Let be a sequence of i.i.d. random variables. Then, fixing ,
[TABLE]
where is the square root of the fourth central moment of a random variable. The second inequality is an application of Cauchy-Schwarz and the limit follows from assumption (B1). Therefore, by the Lindeberg Central Limit Theorem,
[TABLE]
Equivalently,
[TABLE]
By the independence of and and assumption (B2), the desired result now follows by differencing. ∎
Proof of Proposition 5.
Indeed, by the Law of Large Numbers,
[TABLE]
Hence, it is clear that . The fact that is assumption (B2). Finally, for , it follows from Theorem 4 that
[TABLE]
Now, define as
[TABLE]
It suffices to show that
[TABLE]
where is defined in the proof of Theorem 4. Then, we have that
[TABLE]
We will consider each of the two terms separately. Applying Cauchy-Schwarz to the first term, it follows that
[TABLE]
For the second term, note that
[TABLE]
which finishes the proof. ∎
7.3 Proofs for Section 4
Proof of Proposition 6.
Indeed, the consistency of is by the Law of Large Numbers since
[TABLE]
Let be a sequence of i.i.d. random variables. For , we have that
[TABLE]
Considering each term separately, we have, by the Law of Large Numbers and assumption (C1),
[TABLE]
For the second term, by a direct variance calculation,
[TABLE]
where the limit uses assumption (C1). Hence,
[TABLE]
Combining everything together shows that
[TABLE]
The proof for is analogous and is omitted, which finishes the proof. ∎
Proof of Lemma 7.
By independence, the joint distribution of is given by
[TABLE]
Therefore, the joint distribution of and is given by
[TABLE]
By standard results on the conditional mean of , it follows that
[TABLE]
which finishes the proof. ∎
Proof of Theorem 8.
From Proposition 6, it follows that
[TABLE]
where , , and are all . We will write . Define
[TABLE]
Note that is positive definite and is invertible almost surely. Then, by the Matrix Inversion Lemma,
[TABLE]
Some algebra will yield
[TABLE]
where we have added and subtracted and applied the Matrix Inversion Lemma. From Lemma 7, it follows that
[TABLE]
Define . Therefore,
[TABLE]
We will prove each of the two terms on the right hand side are . Before doing so, we will prove a few useful facts that will facilitate the remainder of the proof.
- (I)
. 2. (II)
and . 3. (III)
and .
(I) is immediate since is the sum of three positive semi-definite matrices, one of which being , and by assumption (C3). The first part of (II) follows from the fact that
[TABLE]
where denotes stochastic ordering and by assumption (C3). The second part of (II) is identical except we apply (I) to bound the maximal eigenvalue. Finally, for (III), the first part follows from and . For the second part, note that
[TABLE]
Then, . Therefore,
[TABLE]
This proves all three claims. We can now show that , which we will show in parts. To that end, note that
[TABLE]
In the last line, we have used (I) and (III) from above and Lemma 3. Similarly,
[TABLE]
The last line follows from the fact that and (I), (II), and (III) from above. For the last term, we will apply the Matrix Inversion Lemma again to obtain
[TABLE]
Hence,
[TABLE]
where (I), (II), and (III) have all been used from above. Combining these three results with the triangle inequality, this proves that
[TABLE]
For the other quantity, we have that
[TABLE]
Applying all three facts from above demonstrates that
[TABLE]
Using the Cauchy-Schwarz inequality yields
[TABLE]
which finishes the proof. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bradic et al. (2017) Bradic, J., Claeskens, G., & Gueuning, T. (2017). Fixed effects testing in high-dimensional linear mixed models. ar Xiv preprint ar Xiv:1708.04887 .
- 2Brown et al. (2018) Brown, L. D., Mukherjee, G., & Weinstein, A. (2018). Empirical bayes estimates for a two-way cross-classified model. The Annals of Statistics , 46 (4), 1693–1720.
- 3Cai & Guo (2017) Cai, T. T. & Guo, Z. (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics , 45 (2), 615–646.
- 4Chen et al. (2015) Chen, F., Li, Z., Shi, L., & Zhu, L. (2015). Inference for mixed models of anova type with high-dimensional data. Journal of Multivariate Analysis , 133 , 382–401.
- 5Groll & Tutz (2014) Groll, A. & Tutz, G. (2014). Variable selection for generalized linear mixed models by l 1-penalized estimation. Statistics and Computing , 24 (2), 137–154.
- 6Jiang (2007) Jiang, J. (2007). Linear and generalized linear mixed models and their applications . Springer Science & Business Media.
- 7Laurent & Massart (2000) Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 1302–1338.
- 8Leung & Barron (2006) Leung, G. & Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory , 52 (8), 3396–3410.
