Statistical inference with F-statistics when fitting simple models to high-dimensional data
Hannes Leeb, Lukas Steinberger

TL;DR
This paper investigates the validity of F-tests in high-dimensional linear models where the number of predictors exceeds the number of observations, showing asymptotic correctness even under model misspecification.
Contribution
It provides theoretical results demonstrating the asymptotic validity of F-tests for simple linear models in high-dimensional settings, despite potential misspecification.
Findings
F-test remains valid asymptotically in high-dimensional regimes
Validity holds even when the simple model is misspecified
Results applicable to models with many more predictors than observations
Abstract
We study linear subset regression in the context of the high-dimensional overall model with univariate response and a -vector of random regressors , independent of . Here, "high-dimensional" means that the number of available explanatory variables is much larger than the number of observations. We consider simple linear sub-models where is regressed on a set of regressors given by , for some matrix of full rank . The corresponding simple model, i.e., , can be justified by imposing appropriate restrictions on the unknown parameter in the overall model; otherwise, this simple model can be grossly misspecified. In this paper, we establish asymptotic validity of the standard -test on the surrogate parameter , in an appropriate sense, even when…
| 1 | 2 | 5 | 25 | 1 | 2 | 5 | 25 | ||
|---|---|---|---|---|---|---|---|---|---|
| Exp(1) | |||||||||
| 2 | 0.077 | 0.141 | |||||||
| 4 | 0.056 | 0.076 | 0.093 | 0.140 | |||||
| 10 | 0.032 | 0.047 | 0.066 | 0.052 | 0.071 | 0.109 | |||
| 50 | 0.009 | 0.013 | 0.017 | 0.019 | 0.014 | 0.015 | 0.020 | 0.033 | |
| 100 | 0.007 | 0.008 | 0.009 | 0.010 | 0.009 | 0.009 | 0.012 | 0.015 | |
| 200 | 0.006 | 0.007 | 0.006 | 0.008 | 0.007 | 0.007 | 0.006 | 0.009 | |
| Unif | |||||||||
| 2 | 0.188 | 0.025 | |||||||
| 4 | 0.158 | 0.225 | 0.020 | 0.023 | |||||
| 10 | 0.122 | 0.167 | 0.238 | 0.011 | 0.014 | 0.016 | |||
| 50 | 0.062 | 0.084 | 0.116 | 0.123 | 0.006 | 0.006 | 0.007 | 0.007 | |
| 100 | 0.048 | 0.061 | 0.081 | 0.082 | 0.005 | 0.006 | 0.006 | 0.005 | |
| 200 | 0.033 | 0.044 | 0.057 | 0.055 | 0.005 | 0.005 | 0.005 | 0.006 | |
| Gauss | |||||||||
| 2 | 0.335 | 0.005 | |||||||
| 4 | 0.332 | 0.458 | 0.006 | 0.005 | |||||
| 10 | 0.301 | 0.411 | 0.563 | 0.005 | 0.005 | 0.006 | |||
| 50 | 0.250 | 0.335 | 0.456 | 0.518 | 0.005 | 0.006 | 0.005 | 0.005 | |
| 100 | 0.228 | 0.314 | 0.412 | 0.457 | 0.005 | 0.005 | 0.006 | 0.005 | |
| 200 | 0.212 | 0.286 | 0.383 | 0.407 | 0.005 | 0.005 | 0.006 | 0.006 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Random Matrices and Applications · Statistical Methods and Bayesian Inference
Statistical inference with -statistics
when fitting simple models to high-dimensional data
Hannes Leeb (University of Vienna and DataScience@UniVienna)
Lukas Steinberger (University of Freiburg)
Abstract
We study linear subset regression in the context of the high-dimensional overall model with univariate response and a -vector of random regressors , independent of . Here, ‘high-dimensional’ means that the number of available explanatory variables is much larger than the number of observations. We consider simple linear sub-models where is regressed on a set of regressors given by , for some matrix of full rank . The corresponding simple model, i.e., , can be justified by imposing appropriate restrictions on the unknown parameter in the overall model; otherwise, this simple model can be grossly misspecified. In this paper, we establish asymptotic validity of the standard -test on the surrogate parameter , in an appropriate sense, even when the simple model is misspecified.
1 Introduction
The -test is a staple tool of applied statistical analyses. It is widely used, sometimes also in situations where its applicability is debatable because underlying assumptions may not be met. We study a situation of this kind: An -test after fitting a (possibly misspecified) working model. We focus, in particular, on a scenario where the fitted model has explanatory variables while the true model has explanatory variables, with , and where sample size is of the same order as , i.e., . Scenarios like this occur, for example, in quality control studies like Souders and Stenbakken (1991), where a model with 18 explanatory variables (out of a total of about 8,000) is fit based on a sample of size 50; in time series forecasting with principal components as in Stock and Watson (2002), who extract a handful of factors from 149 explanatory variables based on 480 monthly observations; or in genetic analyses like van’t Veer et al. (2002), who select and fit a model with 70 genes (out of a total of about 25,000) based on a sample of size 78. In situations like these, the question whether the fitted model has any explanatory value is of particular interest. We show that, approximately, the usual -statistic is -distributed under a corresponding null-hypothesis, and that it is non-central -distributed in a local neighborhood of the null. Approximation errors go to zero as if and if, at the same time, is of the same, or of slower, order as ; cf. Theorem 4.1 and Remark 4.3, respectively. Our results are uniform over a large region of the parameter space that we consider. In particular, our results also cover situations where the fitted model is misspecified. The setting of our analysis is non-standard in that we require a particular constellation of , and . This is a challenging setting of practical relevance, for which few theoretical results are available so far. Our findings, which are given for independent observations, also prompt the question whether similar results can be obtained under serial correlation.
The -statistic is exactly -distributed in a correctly specified linear model with Gaussian errors; and it is asymptotically -distributed under the strong Gauß-Markov condition on the errors if while the model dimension stays fixed; cf. Anderson (1958). -tests in correctly specified models in settings where is allowed to increase with are studied, among others, by Akritas and Arnold (2000); Bathke and Lankowski (2005); Boos and Brownie (1995); Harrar and Bathke (2008); Portnoy (1984, 1985); Wang and Cui (2013). In addition, there are several viable alternatives to the -test in potentially misspecified settings; see, for example, Chen and Qin (2010); Eicker (1967); Huber (1967); White (1980a, b); Zhong and Chen (2011). For further results on hypothesis testing and marginal screening in misspecified models, see, for example, Boos and Stefanski (2013); Choi and Kiefer (2011); Fomby and Hill (2003); Jensen and Ramirez (1991); Ramirez and Jensen (1991), and the references therein.
On a technical level, this paper relies on Wang and Cui (2013), the corresponding extensions and corrections in Steinberger (2016), and also on Steinberger and Leeb (2018a, b); all but the first of these references are based on Steinberger (2015).
The rest of the paper is structured as follows: In Section 2, we describe the true data-generating model and the underlying parameter space. The (typically misspecified) working model and the corresponding -statistic are described in Section 3. Our main theoretical result is given in Section 4, and a simulation study in Section 5 demonstrates that our asymptotic approximations can ‘kick-in’ reasonably fast.
2 The true model
Throughout, we consider the (true) linear model
[TABLE]
with and for some . We assume that the error is independent of , with mean zero and finite variance ; its distribution will be denoted by . Moreover, we assume that the vector of regressors has mean and positive definite variance/covariance matrix . Our model assumptions are further discussed in Steinberger and Leeb (2018a, Remark 7.1). No additional restrictions will be placed on the regression coefficients and , on the moments and , or on the error distribution .
We do place some assumptions on the distribution of the explanatory variables. First, we assume that can be written as an affine transformation of independent random variables. With this, we can represent the -vector as
[TABLE]
for a -vector with independent (but not necessarily identically distributed) components so that and , where is the positive definite and symmetric square root of , and where is an orthogonal (non-random) matrix. Second, we assume that has a Lebesgue density, which we denote by , with bounded marginal densities and finite marginal moments of sufficiently high order. In particular, we will assume that belongs to one of the classes that are defined in the next paragraph, for appropriate constants , and . Our assumptions on are similar to those maintained by Bai and Saranadasa (1996) and Zhong and Chen (2011). For later use, note that the distribution of in (1)–(2) is characterized by and , by , by and , by , and by .
Fix an integer and positive (finite) constants and . With this, write for the class of Lebesgue densities on that are products of univariate marginal densities such that each such marginal density is bounded from above by , and such that each univariate marginal density has absolute moments of order up to that are bounded by .
3 The sub-model and the -test
Consider a sub-model where is regressed on , with given by
[TABLE]
for some full-rank matrix with . For example, can be a selection matrix that picks out components of the -vector . Submodels with regressors of the form also occur in principal component regression, partial least squares, and certain sufficient dimension reduction methods. We are particularly interested in situations where is much larger than , i.e., . Trivially, we can write
[TABLE]
with , where and minimize . The ‘error’ has mean zero (because both (1) and (4) include an intercept), and we denote its variance by . Note that and, for later use, that
[TABLE]
Irrespective of whether the working model is correctly specified, the ‘surrogate’ parameters , and are always well-defined. Here, is our main object of interest, instead of the underlying true parameter . Such surrogate parameters are well-known in the statistics literature, certainly since Huber (1967), and have recently gained new popularity, as witnessed by, e.g., Abadie et al. (2014); Brannath and Scharpenberg (2014); Bachoc et al. (2015); Buja et al. (2014). In particular, such surrogate parameters can be consistently estimated, in a standard -estimation setting, by the OLS estimator or by robust alternatives, provided that is not too large relative to (see Portnoy, 1984, 1985; White, 1980a, b); cf. also Lemma A.3 in Steinberger (2015) and Lemma A.4 in Steinberger and Leeb (2018a) for analyses tailored to our present setting.
The working model (4) is correct (in the usual sense) if , i.e., if or, equivalently, if . This is the case if lies in the column space of ; if is a selection matrix, this means that selects all the non-zero components of . Here, we do not assume that the working model is correct. In particular, we stress that may differ from , and that may depend on .
When working with the simple sub-model (4), a natural question is whether has any explanatory value for the response variable . Given a sample of independent and identically distributed (i.i.d.) observations of and from (4), a classical approach to this question is to use the -test of the hypotheses
[TABLE]
Let and denote the vector of responses and the matrix of explanatory variables, respectively. Write for the OLS-estimator for when is regressed on and a constant, set , and write for the usual -statistics for testing , i.e., if the numerator is well-defined and the denominator is positive and otherwise. Here, denotes the orthogonal projection on the space spanned by the column-vectors indicated in the subscript and denotes the -vector . Note that with probability one by our assumptions.
may be re-phrased as the hypothesis that the best linear predictor of given is constant. An alternative to is the hypothesis that the Bayes-estimator of given is constant, i.e.,
[TABLE]
Testing this non-parametric hypothesis is more difficult. In the asymptotic setting that we consider in the next section, however, we find that and are close to each other in the sense that the Bayes predictor and the best linear predictor (of given ) are close in terms of mean-squared prediction error; see Remark 4.2 for details.
4 Main result
Our main result is concerned with the asymptotic distribution of the -statistic in a local neighborhood of the null-hypothesis. Here, the local neighborhood is defined through the requirement that
[TABLE]
is small. This quantity can be interpreted as a signal-to-noise ratio in (4) and depends on , , and ; cf. (5). If the error in (4) is Gaussian and independent of , then the -statistic is -distributed with parameters , and non-centrality parameter ; in that case, we have , where denotes the cumulative distribution function (c.d.f.) of the -distribution with indicated parameters. In our present setting, however, the error in (4) need not be Gaussian and can depend on .
We will show that the distribution of can be approximated by an -distribution, uniformly over most parameters in the model. Only for , and , i.e., for the error in (1) and for the density of the standardized explanatory variables as well as the orthogonal matrix in (2), some restrictions are needed. We will require a moment restriction on , and we will require that belongs to one of the classes introduced earlier. To formulate the restriction on , write for the collection of all orthogonal matrices and write for the uniform distribution on that set; i.e., is the normalized Haar measure on the -dimensional orthogonal group. For , we will require that it belongs to a Borel set that is large in terms of .
Theorem 4.1**.**
Fix finite constants and , and positive finite constants , , and . For each full-rank matrix , each variance/covariance matrix and each there exists a Borel set so that
[TABLE]
and so that the following holds: If denotes either the quantity
[TABLE]
or the quantity
[TABLE]
for some fixed , then
[TABLE]
This statement continues to hold if the restriction in the last display is replaced by provided that . [Here, the suprema are taken over all full-rank matrices , all , all -vectors and , all distributions so that has mean zero and finite positive variance, and all symmetric and positive definite matrices , subject to the indicated restrictions.]
Remark 4.2**.**
Write and for the prediction risk of the Bayes predictor and of the best linear predictor, respectively, of given . That is, and . The results of Steinberger and Leeb (2018a) then entail that, in the setting of Theorem 4.1, converges to one, uniformly over all the parameters indicated in the last display of that theorem. In fact, the risk-ratio converges to one uniformly even if the restriction on is removed altogether, and a similar statement holds for the ratio of conditional risks given , i.e., . See Theorem 3.1 of Steinberger and Leeb (2018a) for a more general form of this statement under weaker assumptions.
Remark 4.3**.**
Although the asymptotic approximations in Theorem 4.1 require that is of the same order as , we point out that the non-central -distribution should still give a reasonable approximation to the distribution of the -statistic, i.e., the expression in (7) should be small, even if is very small, and, in particular, if is fixed while increases. This situation is further discussed in Steinberger (2015, p. 31, Section 3.2.2) in a setting where , is fixed and . Clearly, the same is not true for the expression in (8), because the normal approximation to the is valid only if both degrees of freedom, i.e., and , are large. The statement regarding (8) in Theorem 4.1 coincides with the conclusion of Theorem 1 in Zhong and Chen (2011) obtained for the correctly specified Gaussian error case. Moreover, the Gaussian approximation in (8) has the advantage that it is easier to interpret than the more complicated distribution function of the non-central -distribution in (7); see also the discussion in Steinberger (2016, Remark 2.4).
5 Simulation analysis
Theorem 4.1 is an asymptotic result. In this section, we study a range of non-asymptotic scenarios through simulation to investigate how soon these asymptotic approximations become accurate. We consider a rather small sample size of and look at different configurations of the model dimensions and with , and also at different points in parameter space.
The theorem contains two asymptotic statements, one about the distribution of the -statistic and one about the size of the set . For the distribution of the -statistic, we compare the rejection probability of the -test under the null hypothesis with the nominal significance level . The nominal significance level provides a natural benchmark. [Clearly, one can also investigate the power of the -test through simulation experiments, but, unlike the significance level, it is less obvious what the right benchmark for the power should be.] In particular, we simulate 1000 independent realizations , of the -statistic at sample size under the null for each point in parameter space (the index will be explained shortly), and compare the empirical significance level with the nominal level .
Gauging the size of is more difficult, because that set is not given explicitly. We proceed as follows: We fix all the parameters in (1)–(2) except for the orthogonal matrix in (2). We then simulate 100 independent realizations of , compute as outlined above, , and finally compute . If , then should be close to , in view of the last display in Theorem 4.1. We use and the empirical distribution of the , , as indicators for the size of .
The remaining parameters in (1)–(2) and the submodel matrix are chosen as follows for any fixed values of and : We do not include an error term in the true model, i.e., we set , because the effect of misspecification becomes more pronounced when the error variance is small.111 Note that if the error variance in the true model is overly large, i.e., much larger than , then the scaled true model is essentially given by . Since the -statistic is scale-invariant and is independent of , we then have . In that case, the -statistic will essentially follow the null-distribution and we expect a rejection probability close to the nominal level, irrespective of and .
[Note that the case where is not covered by Theorem 4.1, but inspection of the proof shows that our results also apply in this case; cf. Remark A.3.] For , we consider product distributions with zero mean and i.i.d. components from the student- distribution with , and degrees of freedom, as well as from the centered exponential, uniform, Bernoulli and Gaussian distributions. [Note that the scaling of these distributions is inconsequential, because of the scale-invariance of the -statistic in both arguments and the fact that we do not include an error term in the full model, i.e., scaling of is equivalent to scaling of both and . Similarly, also the scaling of and has no impact on the value of the -statistic.] For , we chose a spiked covariance matrix with eigenvalues and and an orthogonal matrix of eigenvectors chosen randomly from the uniform distribution on the orthogonal group.222 The spiked covariance model corresponds to a factor model where the identity matrix is perturbed by a low rank matrix. It has received much attention in the literature on high dimensional random matrices (e.g., Baik and Silverstein, 2006; Cai et al., 2013; Donoho et al., 2013; Johnstone, 2001). We have repeated the simulations also with covariance matrices of an AR process and obtained essentially the same results.
The intercept terms and are set to zero, for convenience. For the matrix , which describes the working model, we take equal to the matrix whose -th column is the -th standard basis vector in , . In other words, we consider a sub-model that includes only the first regressors (out of ). For the parameter , we need to ensure that the null hypothesis is satisfied, i.e., that . By construction of , is regular, and we choose , for one realization of , to guarantee that .
The results of the simulations are summarized in Table 1 and Figures 1 and 2. From Table 1, the overall picture we get is consistent with what was predicted by our theory. For all distributions except the Gaussian, the average absolute difference between the true (simulated) rejection probabilities and the nominal level decreases as increases. This phenomenon is most pronounced for the exponential distribution, which has a finite moment generating function around the origin, and is weakest for the -distribution, which does not even have finite variance. For uniformly distributed design, which is bounded, the effect of misspecification on the size of the -test is relatively mild already for small dimensions. In the Gaussian case, all sub-models of the form (4) are correct in the sense that the error is Gaussian with mean zero and independent of , so that theoretically the corresponding panel in Table 1 should contain only zeroes. The numbers therefore represent only the simulation error and serve as a benchmark for the other panels. We also see a monotonic increase, in the deviation of the size of the -test from the nominal level, as the dimension of the sub-model increases, which was also suggested by our theory. However, if we fix the ratio , i.e., if we move along the staircase pattern in each of the panels, except for the heavy tailed distributions and , we still see the effect of misspecification decrease as increases. This suggests that convergence of to zero, as required in Theorem 4.1, may not be necessary, at least in the scenarios considered here.
In Table 1, the effect of the orthogonal matrix on the actual significance level of the -test was compressed into one summary statistic, namely the mean absolute deviation from the nominal significance level. To get a more comprehensive picture, Figures 1 and 2 show plots of the sample (gray crosses) and superimposed box-plots for different design distributions. Due to limited space we present only the results for sub-models of dimension . In view of Theorem 4.1, we expect that the size , i.e., the family of matrices for which (7) and (8) get small, grows with . Consequently, we expect that many of the should be close to . On the other hand, if is not large then many matrices will lead to a biased rejection probability due to misspecification of the working model. This is exactly what we observe in Figures 1 and 2. For small values of , the rejection probabilities are systematically biased and we see some variability of their values due to the variation in the choice of (compare benchmark panel in Figure 2). Both the bias and the variability in reduce when increases, which is what we expected, as for large , most will be favorable and we obtain small misspecification errors uniformly over these favorable . What is remarkable is the systematic over-rejection in case of the - and exponential distribution and the under-rejection for Bernoulli and uniformly distributed designs. We currently can not explain the mechanism that is responsible for this pattern. Finally, the benchmark panel shows i.i.d. samples with and success probabilities . This provides some idea what portion of the variability observed in the other panels is due to random simulation error. Clearly, the results in the benchmark panel could have been equivalently obtained by repeating the previous simulation for the -test with Gaussian design at significance levels .
Acknowledgments
The first author’s research was partially supported by FWF projects P 26354-N26 and P 28233-N32.
Appendix A Proofs
We begin with some preliminary considerations that connect this paper with the results of Steinberger and Leeb (2018b). In particular, we use Theorem 2.1, parts (ii) and (iii), in that reference with and : If , then the assumptions of that result are satisfied in view of Example 3.1 in Steinberger and Leeb (2018b). The theorem guarantees existence of a Borel subset of the Stiefel manifold of order , that depends on the density , such that for all both
[TABLE]
and
[TABLE]
are bounded from above by
[TABLE]
such that
[TABLE]
where denotes the uniform distribution on the Stiefel manifold, and such that the set is right-invariant under the action of , i.e., whenever . Here, the constant depends only on , and the constant depends only on .
For any full rank matrix , any symmetric positive definite matrix and , we define the set
[TABLE]
Now take a random matrix that is uniformly distributed on and another random matrix that is uniformly distributed on , such that and are independent, and note that by right-invariance of ,
[TABLE]
because and is characterized by left and right invariance under the appropriate orthogonal groups. It follows that is bounded by the expression on the right-hand side of (11) whenever , which establishes the first claim of Theorem 4.1. The proof of the second claim is more elaborate.
The results in the preceding paragraph also show that the error in the working model (4) is such that is approximately zero and is approximately constant, provided that : We first re-write the error in a convenient form. Set and . Then it is easy to see that and hence
[TABLE]
see also (4)–(5). Our goal is to show that the expressions in the preceding two displays are approximately zero. To this end, we focus on the expressions in curly brackets and use Cauchy-Schwarz: For each , we have
[TABLE]
Now if , then it is easy to see that . Because conditioning on is equivalent to conditioning on , it follows that is bounded from above by (9) with replaced by and that is bounded by (9) with replaced by .
The consideration in the preceding paragraph suggests that the effect of misspecification in (4), where may be non-zero and may be non-constant, may be negligible in an asymptotic setting where becomes small, provided that and that . This idea is formalized in the following two results, which show that the distribution of certain statistics is unaffected asymptotically if the error is replaced by a substitute error that has mean zero and constant variance conditional on . The following results are stated for sequences where the data-generating model (1)-(2) and the working model (4) are allowed to depend on , that is, a ‘triangular array’ setting where all parameters depend on .
Lemma A.1**.**
Fix finite positive constants and . For every , let be positive integers so that as . For each , consider as in (1)–(3) but with and replacing and , respectively, with and with . And for each , consider a sample of i.i.d. observations , , of , stack the values of the individual variables into a vector and matrices and , respectively, and write for the vector of errors from (4). Finally, define a vector of substitute errors through . Then, for every and (possibly random) symmetric idempotent matrices ,
[TABLE]
as . As a by product, we also obtain that
[TABLE]
Proof.
First, note that , so that is well defined (almost surely). For the claim in (13), fix and , and consider . Now, using the simple observation , we get
[TABLE]
and furthermore
[TABLE]
The claim (13) will follow if each of the four terms in (LABEL:tmp2) is of the order . Because and , the considerations leading up to Lemma A.1 apply. Also note that . For the last term in (LABEL:tmp2), we obtain, for every , that
[TABLE]
and the upper bound goes to zero as in view of the assumption that . For the second-to-last term in (LABEL:tmp2), we have . For the second term in (LABEL:tmp2), we proceed like for the last term in (LABEL:tmp2). In particular, we obtain, for any , that
[TABLE]
Again, this upper bound goes to zero as because . Note that the considerations in the preceding display also entail that .
For the claim in (14), write
[TABLE]
and note that by definition of and the variance decomposition formula, we have and , so that by independence . Premultiplying by in the previous display and applying (13) finishes the proof of the second claim. ∎
Lemma A.2**.**
Fix and an integer . Under the assumptions and in the notation of Lemma A.1, assume that for each , that and that . Define substitute data . Then, for every , we have
[TABLE]
as .
Proof.
The idea is to use Lemma A.1 to approximate by . In particular, we will show that on some event to be defined below, we have
[TABLE]
where converges to one and converges to zero, both at an arbitrary polynomial rate in , and where . The probability of will be shown to converge to one. The claim of the lemma follows from this.
Set , where . With this, define the event . On , by block matrix inversion, we have . Using the abbreviation , we thus see that and that the -statistic can be written as
[TABLE]
This establishes a representation on . On the complement of , we set , say. We next show that for every fixed , and .
To verify the claimed properties of these quantities, on , consider first
[TABLE]
Using Lemma A.1, we see that the first fraction in this representation multiplied by converges to zero in probability. The second fraction obviously equals . Define like (see the discussion following (6)) but with replacing . We show that in probability. To see this, first note that the convergence to zero of follows again from Lemma A.1. For the ratio , convergence to in probability follows, e.g., from Lemma C.1 in Steinberger (2016), upon verifying its assumptions. To this end, it remains to show that . Using , for , we have
[TABLE]
The maximum in the preceding display converges to one in probability if converges to one in probability, which follows from Lemma A.1. The arithmetic mean of the conditional fourth moments is if the unconditional mean of forth moments is bounded in . To this end, note that we have and ; cf. (5) and the discussion right before (12). With this, we get
[TABLE]
and take expectations. The claim follows now from and the fact that the fourth spherical moment of is uniformly bounded in view of Rosenthal’s inequality (Rosenthal, 1970, Theorem 3) and the assumption that . Note that this also entails .
To see that also behaves as desired, first note that on ,
[TABLE]
The factor can be bounded by for some constant by assumption; the ratio was shown to converge to one in probability in the preceding paragraph. The difference of quadratic forms converges to zero in probability by Lemma A.1, even when multiplied by . Noting that , the scaled second term in parentheses, i.e., , can be bounded by
[TABLE]
where converges to zero in probability by Lemma A.1 and by assumption. It remains to show that the largest singular value of is bounded in probability. Due to the projection onto the orthogonal complement of , the distribution of this quantity does not depend on the parameter , which is why we may assume that for this part of the argument. Abbreviate , and consider . Taking expectation, noting that and , we arrive at the desired boundedness in probability.
It remains to show that . To this end, recall that in probability, and one easily verifies that
[TABLE]
here, the first equality is obtained by arguing as in the first paragraph of the proof but with replacing , and the second equality follows upon noting that and that is a vector with i.i.d. components, each of which has variance . ∎
Proof of Theorem 4.1.
Define as in the beginning of the appendix and note that the first statement in the theorem, concerning , has already been established there. For the second statement, concerning , let be positive integers so that and so that as . For each , consider a sample of i.i.d. observations , , as in Lemma A.1, so that the underlying quantities (i.e., , , , , , , , , and ) satisfy the restrictions in the suprema in the last display of Theorem 4.1. For given , we stress that the restriction on implicitly also restricts the parameters , and ; see the definition of at the beginning of Section 4 as well as the relations in (5). We have to show that as .
Set and for each , and define for each as in Lemma A.2. We first show that
[TABLE]
by verifying the assumptions of Theorem 2.1(i) in Steinberger (2016) for the sample , with the symbols , and in that reference equal to , , and , respectively. In particular, we need to verify conditions (A1).(a,b,c,d) and (A2) in that reference. The design conditions (A1).(a,c,d) are easily verified by use of Lemma A.2(i) in Steinberger (2016). And our assumptions that and that imply condition (A1).(b). Assumption (A2) on the scaled errors is established by an argument similar to the one also used in the third paragraph of the proof of Lemma A.2 but for the -th moment instead of the fourth moment: Simply decompose , with and , and use Lemma A.1 as before to get in probability. Then, the assumption that and the fact that the marginals of have bounded 20th moment, together with Rosenthal’s inequality establish the boundedness of , which is sufficient for (A2). Using Lemma A.2 and noting that for some , it follows that (18) continues to hold with replacing .
Now standard arguments conclude the proof: First, note that an appropriately scaled and centered -distributed random variable with and degrees of freedom and non-centrality parameter is also asymptotically normal, i.e.,
[TABLE]
because implies that . Hence, we have
[TABLE]
and the last two suprema converge to zero in view of Polya’s theorem, which establishes the in case equals (7). Finally, it is elementary to verify that also converges to zero in case equals (8): This follows from (19) with replacing , because the quantiles of the central -distribution satisfy . ∎
Remark A.3**.**
Inspection of the proof reveals that the assumption that is positive is used only to guarantee that almost surely (and hence also ). If this assumption is dropped, we thus see that (defined in Theorem 4.1) converges to zero along sequences of parameters as used in the proof of Theorem 4.1, provided that almost surely for each (as then a.s.).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadie et al. (2014) G. Abadie, G. W. Imbens, and F. Zheng. Inference for misspecified models with fixed regressors. J. Amer. Statist. Assoc. , 109 :1601–1614, 2014.
- 2Akritas and Arnold (2000) M. Akritas and S. Arnold. Asymptotics for analyis of variance when the number of levels is large. J. Amer. Statist. Assoc. , 95 :212–226, 2000.
- 3Anderson (1958) T. W. Anderson. An introduction to multivariate analysis . Wiley, New York, NY, 1958.
- 4Bachoc et al. (2015) F. Bachoc, H. Leeb, and B. M. Pötscher. Valid confidence intervals for post-model-selection prediction. ar Xiv:1412.4605, 2015.
- 5Bai and Saranadasa (1996) Z. Bai and H. Saranadasa. Effect of high dimension: By an example of a two sample problem. Stat. Sinica , 6 :311–329, 1996.
- 6Baik and Silverstein (2006) J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivar. Anal. , 97 :1382–1408, 2006.
- 7Bathke and Lankowski (2005) A. Bathke and D. Lankowski. Rank procedures for a large number of treatments. J. Statist. Plann. Inference , 133 :223–238, 2005.
- 8Boos and Brownie (1995) D. D. Boos and C. Brownie. ANOVA and rank tests when the number of treatments is large. Statist. Probab. Lett. , 23 :183–191, 1995.
