Statistical inference with F-statistics when fitting simple models to   high-dimensional data

Hannes Leeb; Lukas Steinberger

arXiv:1902.04304·math.ST·February 13, 2019

Statistical inference with F-statistics when fitting simple models to high-dimensional data

Hannes Leeb, Lukas Steinberger

PDF

Open Access

TL;DR

This paper investigates the validity of F-tests in high-dimensional linear models where the number of predictors exceeds the number of observations, showing asymptotic correctness even under model misspecification.

Contribution

It provides theoretical results demonstrating the asymptotic validity of F-tests for simple linear models in high-dimensional settings, despite potential misspecification.

Findings

01

F-test remains valid asymptotically in high-dimensional regimes

02

Validity holds even when the simple model is misspecified

03

Results applicable to models with many more predictors than observations

Abstract

We study linear subset regression in the context of the high-dimensional overall model $y = ϑ + θ^{'} z + ϵ$ with univariate response $y$ and a $d$ -vector of random regressors $z$ , independent of $ϵ$ . Here, "high-dimensional" means that the number $d$ of available explanatory variables is much larger than the number $n$ of observations. We consider simple linear sub-models where $y$ is regressed on a set of $p$ regressors given by $x = M^{'} z$ , for some $d \times p$ matrix $M$ of full rank $p < n$ . The corresponding simple model, i.e., $y = α + β^{'} x + e$ , can be justified by imposing appropriate restrictions on the unknown parameter $θ$ in the overall model; otherwise, this simple model can be grossly misspecified. In this paper, we establish asymptotic validity of the standard $F$ -test on the surrogate parameter $β$ , in an appropriate sense, even when…

Tables1

Table 1. Table 1: Average absolute differences D ¯ = 1 100 ∑ r = 1 100 | p ¯ r − α | ¯ 𝐷 1 100 superscript subscript 𝑟 1 100 subscript ¯ 𝑝 𝑟 𝛼 \bar{D}=\frac{1}{100}\sum_{r=1}^{100}|\bar{p}_{r}-\alpha| of simulated rejection probabilities p ¯ r = 1 1000 ∑ j = 1 1000 𝟏 { F j , r > F p , n − p − 1 , 0 − 1 ( 1 − α ) } subscript ¯ 𝑝 𝑟 1 1000 superscript subscript 𝑗 1 1000 1 subscript 𝐹 𝑗 𝑟 subscript superscript 𝐹 1 𝑝 𝑛 𝑝 1 0 1 𝛼 \bar{p}_{r}=\frac{1}{1000}\sum_{j=1}^{1000}\mathbf{1}\{F_{j,r}>F^{-1}_{p,n-p-1,0}(1-\alpha)\} and nominal significance level α = 0.05 𝛼 0.05 \alpha=0.05 of the F 𝐹 F -test for H 0 : β = 0 : subscript 𝐻 0 𝛽 0 H_{0}:\beta=0 .

$d \ p$	1	2	5	25	1	2	5	25
		$t (5)$				Exp(1)
2	0.077				0.141
4	0.056	0.076			0.093	0.140
10	0.032	0.047	0.066		0.052	0.071	0.109
50	0.009	0.013	0.017	0.019	0.014	0.015	0.020	0.033
100	0.007	0.008	0.009	0.010	0.009	0.009	0.012	0.015
200	0.006	0.007	0.006	0.008	0.007	0.007	0.006	0.009
		$t (3)$				Unif $[- 1, 1]$
2	0.188				0.025
4	0.158	0.225			0.020	0.023
10	0.122	0.167	0.238		0.011	0.014	0.016
50	0.062	0.084	0.116	0.123	0.006	0.006	0.007	0.007
100	0.048	0.061	0.081	0.082	0.005	0.006	0.006	0.005
200	0.033	0.044	0.057	0.055	0.005	0.005	0.005	0.006
		$t (2)$				Gauss
2	0.335				0.005
4	0.332	0.458			0.006	0.005
10	0.301	0.411	0.563		0.005	0.005	0.006
50	0.250	0.335	0.456	0.518	0.005	0.006	0.005	0.005
100	0.228	0.314	0.412	0.457	0.005	0.005	0.006	0.005
200	0.212	0.286	0.383	0.407	0.005	0.005	0.006	0.006

Equations99

y = ϑ + θ^{'} z + ϵ

y = ϑ + θ^{'} z + ϵ

z = μ + Σ^{1/2} R \tilde{z}

z = μ + Σ^{1/2} R \tilde{z}

x = M^{'} z

x = M^{'} z

y = α + β^{'} x + e

y = α + β^{'} x + e

β s^{2} = (M^{'} Σ M)^{- 1} M^{'} Σ θ and = θ^{'} Σ θ + θ^{'} Σ M (M^{'} Σ M)^{- 1} M^{'} Σ θ + σ^{2} .

β s^{2} = (M^{'} Σ M)^{- 1} M^{'} Σ θ and = θ^{'} Σ θ + θ^{'} Σ M (M^{'} Σ M)^{- 1} M^{'} Σ θ + σ^{2} .

H_{0} : β = 0 versus H_{1} : β \neq = 0.

H_{0} : β = 0 versus H_{1} : β \neq = 0.

\tilde{H}_{0} : E [y ∥ x] is constant.

\tilde{H}_{0} : E [y ∥ x] is constant.

Δ = Var (β^{'} x) / Var (e) = β^{'} M^{'} Σ M β / s^{2}

Δ = Var (β^{'} x) / Var (e) = β^{'} M^{'} Σ M β / s^{2}

\sup_{\footnotesize\begin{array}[]{c}M\end{array}}\;\sup_{\Sigma}\;\sup_{f_{\tilde{z}}\in{\mathcal{F}}_{d,20}(D,E)}\;\nu_{d}({\mathbb{U}})\quad\stackrel{{\scriptstyle[}}{{]}}{\frac{p}{\log d}\to 0}{\longrightarrow}\quad 1

\sup_{\footnotesize\begin{array}[]{c}M\end{array}}\;\sup_{\Sigma}\;\sup_{f_{\tilde{z}}\in{\mathcal{F}}_{d,20}(D,E)}\;\nu_{d}({\mathbb{U}})\quad\stackrel{{\scriptstyle[}}{{]}}{\frac{p}{\log d}\to 0}{\longrightarrow}\quad 1

\sup_{t\in{\mathbb{R}}}\left|{\mathbb{P}}\Big{(}\hat{F}_{n}\leq t\Big{)}-F_{p,n-p-1,n\Delta}(t)\right|

\sup_{t\in{\mathbb{R}}}\left|{\mathbb{P}}\Big{(}\hat{F}_{n}\leq t\Big{)}-F_{p,n-p-1,n\Delta}(t)\right|

{\mathbb{P}}\Big{(}\hat{F}_{n}>F^{-1}_{p,n-p-1,0}(\alpha)\Big{)}-\Phi\Big{(}-\Phi^{-1}(\alpha)+\sqrt{n}\Delta\sqrt{\frac{1-p/n}{2p/n}}\Big{)}

{\mathbb{P}}\Big{(}\hat{F}_{n}>F^{-1}_{p,n-p-1,0}(\alpha)\Big{)}-\Phi\Big{(}-\Phi^{-1}(\alpha)+\sqrt{n}\Delta\sqrt{\frac{1-p/n}{2p/n}}\Big{)}

\sup_{\footnotesize\begin{array}[]{c}M\end{array}}\;\sup_{\footnotesize\begin{array}[]{c}\vartheta,\theta,{\mathcal{L}}(\epsilon),\mu,\Sigma\\ {\mathbb{E}}|\epsilon/\sigma|^{8+\lambda}\leq L\\ \Delta<\gamma/\sqrt{n}\end{array}}\;\sup_{f_{\tilde{z}}\in{\mathcal{F}}_{d,20}(D,E)}\;\sup_{R\in{\mathbb{U}}}\;\;\;\Xi_{n}\quad\stackrel{{\scriptstyle[}}{{\frac{}}{missing}}{n^{2}}{\log d}\to 0,\frac{p}{n}\to\rho]{n\to\infty}{\longrightarrow}\quad 0.

\sup_{\footnotesize\begin{array}[]{c}M\end{array}}\;\sup_{\footnotesize\begin{array}[]{c}\vartheta,\theta,{\mathcal{L}}(\epsilon),\mu,\Sigma\\ {\mathbb{E}}|\epsilon/\sigma|^{8+\lambda}\leq L\\ \Delta<\gamma/\sqrt{n}\end{array}}\;\sup_{f_{\tilde{z}}\in{\mathcal{F}}_{d,20}(D,E)}\;\sup_{R\in{\mathbb{U}}}\;\;\;\Xi_{n}\quad\stackrel{{\scriptstyle[}}{{\frac{}}{missing}}{n^{2}}{\log d}\to 0,\frac{p}{n}\to\rho]{n\to\infty}{\longrightarrow}\quad 0.

\sup_{B\in\mathbb{G}}{\mathbb{P}}\left(\big{\|}{\mathbb{E}}[\tilde{z}\|B^{\prime}\tilde{z}]-BB^{\prime}\tilde{z}\big{\|}>t\right)

\sup_{B\in\mathbb{G}}{\mathbb{P}}\left(\big{\|}{\mathbb{E}}[\tilde{z}\|B^{\prime}\tilde{z}]-BB^{\prime}\tilde{z}\big{\|}>t\right)

\sup_{B\in\mathbb{G}}{\mathbb{P}}\left(\big{\|}{\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}\|B^{\prime}\tilde{z}]-(I_{d}-BB^{\prime}+BB^{\prime}\tilde{z}\tilde{z}^{\prime}BB^{\prime})\big{\|}>t\right)

\sup_{B\in\mathbb{G}}{\mathbb{P}}\left(\big{\|}{\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}\|B^{\prime}\tilde{z}]-(I_{d}-BB^{\prime}+BB^{\prime}\tilde{z}\tilde{z}^{\prime}BB^{\prime})\big{\|}>t\right)

\frac{1}{t} d^{- 1/20} + 4 γ \frac{p}{lo g d},

\frac{1}{t} d^{- 1/20} + 4 γ \frac{p}{lo g d},

ν_{d, p} (G^{c}) \leq κ d^{- (1 - 20 γ \frac{p}{l o g d}) /20},

ν_{d, p} (G^{c}) \leq κ d^{- (1 - 20 γ \frac{p}{l o g d}) /20},

U := U (M, Σ, f_{\tilde{z}}) := {R \in O_{d} : R^{'} Σ^{1/2} M (M^{'} Σ M)^{- 1/2} \in G (f_{\tilde{z}})} .

U := U (M, Σ, f_{\tilde{z}}) := {R \in O_{d} : R^{'} Σ^{1/2} M (M^{'} Σ M)^{- 1/2} \in G (f_{\tilde{z}})} .

ν_{d} (U)

ν_{d} (U)

= P (U Σ^{1/2} M (M^{'} Σ M)^{- 1/2} V \in G) = ν_{d, p} (G),

\displaystyle\begin{split}{\mathbb{E}}[e\|x]&\quad=\quad\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\Big{\{}{\mathbb{E}}[\tilde{z}\|P_{\tilde{M}}\tilde{z}]-P_{\tilde{M}}\tilde{z}\Big{\}}\quad\text{and}\\ {\mathbb{E}}[e^{2}\|x]-s^{2}&\quad=\quad\\ &\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\Big{\{}{\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}\|P_{\tilde{M}}\tilde{z}]-((I_{d}-P_{\tilde{M}})+P_{\tilde{M}}\tilde{z}\tilde{z}^{\prime}P_{\tilde{M}})\Big{\}}(I_{d}-P_{\tilde{M}})\tilde{\theta};\end{split}

\displaystyle\begin{split}{\mathbb{E}}[e\|x]&\quad=\quad\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\Big{\{}{\mathbb{E}}[\tilde{z}\|P_{\tilde{M}}\tilde{z}]-P_{\tilde{M}}\tilde{z}\Big{\}}\quad\text{and}\\ {\mathbb{E}}[e^{2}\|x]-s^{2}&\quad=\quad\\ &\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\Big{\{}{\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}\|P_{\tilde{M}}\tilde{z}]-((I_{d}-P_{\tilde{M}})+P_{\tilde{M}}\tilde{z}\tilde{z}^{\prime}P_{\tilde{M}})\Big{\}}(I_{d}-P_{\tilde{M}})\tilde{\theta};\end{split}

P (∣ E [e ∥ x] ∣ > t)

P (∣ E [e ∥ x] ∣ > t)

P (∣ E [e^{2} ∥ x] - s^{2} ∣ > t)

\displaystyle P\left(\Big{\|}{\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}\|P_{\tilde{M}}\tilde{z}]-((I_{d}-P_{\tilde{M}})+P_{\tilde{M}}\tilde{z}\tilde{z}^{\prime}P_{\tilde{M}})\Big{\|}>t/\|(I_{d}-P_{\tilde{M}})\tilde{\theta}\|^{2}\right).

n^{k} ∥ E - E^{*} ∥/ s

n^{k} ∥ E - E^{*} ∥/ s

n^{k} ∣ E^{'} P_{n} E - E^{*}^{'} P_{n} E^{*} ∣/ s^{2}

i = 1, \dots, n max ∣ Var [e_{i} ∥ x_{i}] / s^{2} - 1∣ ⟶ p 0.

i = 1, \dots, n max ∣ Var [e_{i} ∥ x_{i}] / s^{2} - 1∣ ⟶ p 0.

∣ e_{1} - e_{1}^{*} ∣/ s

∣ e_{1} - e_{1}^{*} ∣/ s

\leq \frac{s}{Var [ e _{1} ∥ x _{1} ]} (\frac{∣ e _{1} ∣}{s} \frac{∣ Var [ e _{1} ∥ x _{1} ] - s ^{2} ∣}{s ^{2}} + \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s}),

P (n^{2 k + 1} ∣ e_{1} - e_{1}^{*} ∣^{2} / s^{2} > t^{2})

P (n^{2 k + 1} ∣ e_{1} - e_{1}^{*} ∣^{2} / s^{2} > t^{2})

\leq P (n^{k + 1/2} \frac{∣ e _{1} ∣}{s} \frac{∣ Var [ e _{1} ∥ x _{1} ] - s ^{2} ∣}{s ^{2}} + \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s} > t / 2)

+ P (\frac{s ^{2}}{Var [ e _{1} ∥ x _{1} ]} > 2)

\leq P (\frac{Var [ e _{1} ∥ x _{1} ]}{s ^{2}} - 1 > \frac{1}{2}) + P (n^{k + 1/2} \frac{∣ e _{1} ∣}{s} \frac{∣ Var [ e _{1} ∥ x _{1} ] - s ^{2} ∣}{s ^{2}} > t / 2^{3/2})

+ P (n^{k + 1/2} \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s} > t / 2^{3/2})

\leq P (\frac{∣ Var [ e _{1} ∥ x _{1} ] - s ^{2} ∣}{s ^{2}} > \frac{1}{2}) + P (n^{k + 3/2} \frac{∣ Var [ e _{1} ∥ x _{1} ] - s ^{2} ∣}{s ^{2}} > t / 2^{3/2}) + P (\frac{∣ e _{1} ∣}{s} > n) + P (n^{k + 1/2} \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s} > t / 2^{3/2}) .

P (n^{k + 1/2} \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s} > t) \leq t^{- 1} n^{k + 1/2} d_{n}^{- 1/20} + 4 γ \frac{p _{n}}{lo g d _{n}},

P (n^{k + 1/2} \frac{∣ E [ e _{1} ∥ x _{1} ] ∣}{s} > t) \leq t^{- 1} n^{k + 1/2} d_{n}^{- 1/20} + 4 γ \frac{p _{n}}{lo g d _{n}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Random Matrices and Applications · Statistical Methods and Bayesian Inference

Full text

Statistical inference with $F$ -statistics

when fitting simple models to high-dimensional data

Hannes Leeb (University of Vienna and DataScience@UniVienna)

Lukas Steinberger (University of Freiburg)

Abstract

We study linear subset regression in the context of the high-dimensional overall model $y=\vartheta+\theta^{\prime}z+\epsilon$ with univariate response $y$ and a $d$ -vector of random regressors $z$ , independent of $\epsilon$ . Here, ‘high-dimensional’ means that the number $d$ of available explanatory variables is much larger than the number $n$ of observations. We consider simple linear sub-models where $y$ is regressed on a set of $p$ regressors given by $x=M^{\prime}z$ , for some $d\times p$ matrix $M$ of full rank $p<n$ . The corresponding simple model, i.e., $y=\alpha+\beta^{\prime}x+e$ , can be justified by imposing appropriate restrictions on the unknown parameter $\theta$ in the overall model; otherwise, this simple model can be grossly misspecified. In this paper, we establish asymptotic validity of the standard $F$ -test on the surrogate parameter $\beta$ , in an appropriate sense, even when the simple model is misspecified.

1 Introduction

The $F$ -test is a staple tool of applied statistical analyses. It is widely used, sometimes also in situations where its applicability is debatable because underlying assumptions may not be met. We study a situation of this kind: An $F$ -test after fitting a (possibly misspecified) working model. We focus, in particular, on a scenario where the fitted model has $p$ explanatory variables while the true model has $d$ explanatory variables, with $p\ll d$ , and where sample size is of the same order as $p$ , i.e., $p=O(n)$ . Scenarios like this occur, for example, in quality control studies like Souders and Stenbakken (1991), where a model with 18 explanatory variables (out of a total of about 8,000) is fit based on a sample of size 50; in time series forecasting with principal components as in Stock and Watson (2002), who extract a handful of factors from 149 explanatory variables based on 480 monthly observations; or in genetic analyses like van’t Veer et al. (2002), who select and fit a model with 70 genes (out of a total of about 25,000) based on a sample of size 78. In situations like these, the question whether the fitted model has any explanatory value is of particular interest. We show that, approximately, the usual $F$ -statistic is $F$ -distributed under a corresponding null-hypothesis, and that it is non-central $F$ -distributed in a local neighborhood of the null. Approximation errors go to zero as $n\to\infty$ if $n^{2}/\log d\to 0$ and if, at the same time, $p$ is of the same, or of slower, order as $n$ ; cf. Theorem 4.1 and Remark 4.3, respectively. Our results are uniform over a large region of the parameter space that we consider. In particular, our results also cover situations where the fitted model is misspecified. The setting of our analysis is non-standard in that we require a particular constellation of $d$ , $p$ and $n$ . This is a challenging setting of practical relevance, for which few theoretical results are available so far. Our findings, which are given for independent observations, also prompt the question whether similar results can be obtained under serial correlation.

The $F$ -statistic is exactly $F$ -distributed in a correctly specified linear model with Gaussian errors; and it is asymptotically $F$ -distributed under the strong Gauß-Markov condition on the errors if $n\to\infty$ while the model dimension stays fixed; cf. Anderson (1958). $F$ -tests in correctly specified models in settings where $p$ is allowed to increase with $n$ are studied, among others, by Akritas and Arnold (2000); Bathke and Lankowski (2005); Boos and Brownie (1995); Harrar and Bathke (2008); Portnoy (1984, 1985); Wang and Cui (2013). In addition, there are several viable alternatives to the $F$ -test in potentially misspecified settings; see, for example, Chen and Qin (2010); Eicker (1967); Huber (1967); White (1980a, b); Zhong and Chen (2011). For further results on hypothesis testing and marginal screening in misspecified models, see, for example, Boos and Stefanski (2013); Choi and Kiefer (2011); Fomby and Hill (2003); Jensen and Ramirez (1991); Ramirez and Jensen (1991), and the references therein.

On a technical level, this paper relies on Wang and Cui (2013), the corresponding extensions and corrections in Steinberger (2016), and also on Steinberger and Leeb (2018a, b); all but the first of these references are based on Steinberger (2015).

The rest of the paper is structured as follows: In Section 2, we describe the true data-generating model and the underlying parameter space. The (typically misspecified) working model and the corresponding $F$ -statistic are described in Section 3. Our main theoretical result is given in Section 4, and a simulation study in Section 5 demonstrates that our asymptotic approximations can ‘kick-in’ reasonably fast.

2 The true model

Throughout, we consider the (true) linear model

[TABLE]

with $\vartheta\in{\mathbb{R}}$ and $\theta\in{\mathbb{R}}^{d}$ for some $d\in{\mathbb{N}}$ . We assume that the error $\epsilon$ is independent of $z$ , with mean zero and finite variance $\sigma^{2}>0$ ; its distribution will be denoted by $\mathcal{L}(\epsilon)$ . Moreover, we assume that the vector of regressors $z$ has mean $\mu\in{\mathbb{R}}^{d}$ and positive definite variance/covariance matrix $\Sigma$ . Our model assumptions are further discussed in Steinberger and Leeb (2018a, Remark 7.1). No additional restrictions will be placed on the regression coefficients $\vartheta$ and $\theta$ , on the moments $\mu$ and $\Sigma$ , or on the error distribution $\mathcal{L}(\epsilon)$ .

We do place some assumptions on the distribution of the explanatory variables. First, we assume that $z$ can be written as an affine transformation of independent random variables. With this, we can represent the $d$ -vector $z$ as

[TABLE]

for a $d$ -vector $\tilde{z}$ with independent (but not necessarily identically distributed) components so that ${\mathbb{E}}[\tilde{z}]=0$ and ${\mathbb{E}}[\tilde{z}\tilde{z}^{\prime}]=I_{d}$ , where $\Sigma^{1/2}$ is the positive definite and symmetric square root of $\Sigma$ , and where $R$ is an orthogonal (non-random) matrix. Second, we assume that $\tilde{z}$ has a Lebesgue density, which we denote by $f_{\tilde{z}}$ , with bounded marginal densities and finite marginal moments of sufficiently high order. In particular, we will assume that $f_{\tilde{z}}$ belongs to one of the classes ${\mathcal{F}}_{d,k}(D,E)$ that are defined in the next paragraph, for appropriate constants $k$ , $D$ and $E$ . Our assumptions on $z$ are similar to those maintained by Bai and Saranadasa (1996) and Zhong and Chen (2011). For later use, note that the distribution of $(y,z)$ in (1)–(2) is characterized by $\vartheta$ and $\theta$ , by $\mathcal{L}(\epsilon)$ , by $\Sigma$ and $\mu$ , by $f_{\tilde{z}}$ , and by $R$ .

Fix an integer $k\geq 1$ and positive (finite) constants $D$ and $E$ . With this, write ${\mathcal{F}}_{d,k}(D,E)$ for the class of Lebesgue densities on ${\mathbb{R}}^{d}$ that are products of univariate marginal densities such that each such marginal density is bounded from above by $D$ , and such that each univariate marginal density has absolute moments of order up to $k$ that are bounded by $E$ .

3 The sub-model and the $F$ -test

Consider a sub-model where $y$ is regressed on $x$ , with $x$ given by

[TABLE]

for some full-rank $d\times p$ matrix $M$ with $p<d$ . For example, $M$ can be a selection matrix that picks out $p$ components of the $d$ -vector $z$ . Submodels with regressors of the form $x=M^{\prime}z$ also occur in principal component regression, partial least squares, and certain sufficient dimension reduction methods. We are particularly interested in situations where $d$ is much larger than $p$ , i.e., $p\ll d$ . Trivially, we can write

[TABLE]

with $e=y-\alpha-\beta^{\prime}x$ , where $\alpha$ and $\beta$ minimize ${\mathbb{E}}[(y-\alpha-\beta^{\prime}x)^{2}]$ . The ‘error’ $e$ has mean zero (because both (1) and (4) include an intercept), and we denote its variance by $s^{2}={\mathbb{E}}[e^{2}]$ . Note that $\alpha=\vartheta+\mu^{\prime}\theta-\mu^{\prime}M(M^{\prime}\Sigma M)^{-1}M^{\prime}\Sigma\theta$ and, for later use, that

[TABLE]

Irrespective of whether the working model is correctly specified, the ‘surrogate’ parameters $\alpha$ , $\beta$ and $s^{2}$ are always well-defined. Here, $\beta$ is our main object of interest, instead of the underlying true parameter $\theta$ . Such surrogate parameters are well-known in the statistics literature, certainly since Huber (1967), and have recently gained new popularity, as witnessed by, e.g., Abadie et al. (2014); Brannath and Scharpenberg (2014); Bachoc et al. (2015); Buja et al. (2014). In particular, such surrogate parameters can be consistently estimated, in a standard $M$ -estimation setting, by the OLS estimator or by robust alternatives, provided that $p$ is not too large relative to $n$ (see Portnoy, 1984, 1985; White, 1980a, b); cf. also Lemma A.3 in Steinberger (2015) and Lemma A.4 in Steinberger and Leeb (2018a) for analyses tailored to our present setting.

The working model (4) is correct (in the usual sense) if ${\mathbb{E}}[y\|z]={\mathbb{E}}[y\|x]$ , i.e., if $\vartheta+\theta^{\prime}z=\alpha+\beta^{\prime}x$ or, equivalently, if $\epsilon=e$ . This is the case if $\theta$ lies in the column space of $M$ ; if $M$ is a selection matrix, this means that $M^{\prime}\theta$ selects all the non-zero components of $\theta$ . Here, we do not assume that the working model is correct. In particular, we stress that $e$ may differ from $\epsilon$ , and that $e$ may depend on $x$ .

When working with the simple sub-model (4), a natural question is whether $x$ has any explanatory value for the response variable $y$ . Given a sample of $n>p+1$ independent and identically distributed (i.i.d.) observations of $y$ and $x$ from (4), a classical approach to this question is to use the $F$ -test of the hypotheses

[TABLE]

Let $Y$ and $X$ denote the $n\times 1$ vector of responses and the $n\times p$ matrix of explanatory variables, respectively. Write $\hat{\beta}$ for the OLS-estimator for $\beta$ when $Y$ is regressed on $X$ and a constant, set $\hat{s}^{2}=\|(I_{n}-P_{\iota,X})Y\|^{2}/(n-p-1)$ , and write $\hat{F}_{n}=\hat{F}_{n}(X,Y)$ for the usual $F$ -statistics for testing $H_{0}$ , i.e., $\hat{F}_{n}=\|(I_{n}-P_{\iota})X\hat{\beta}\|^{2}/(p\hat{s}^{2})$ if the numerator is well-defined and the denominator is positive and $\hat{F}_{n}=0$ otherwise. Here, $P_{\dots}$ denotes the orthogonal projection on the space spanned by the column-vectors indicated in the subscript and $\iota$ denotes the $n$ -vector $\iota=(1,\dots,1)^{\prime}$ . Note that $\hat{F}_{n}>0$ with probability one by our assumptions.

$H_{0}$ may be re-phrased as the hypothesis that the best linear predictor of $y$ given $x$ is constant. An alternative to $H_{0}$ is the hypothesis that the Bayes-estimator of $y$ given $x$ is constant, i.e.,

[TABLE]

Testing this non-parametric hypothesis is more difficult. In the asymptotic setting that we consider in the next section, however, we find that $H_{0}$ and $\tilde{H}_{0}$ are close to each other in the sense that the Bayes predictor and the best linear predictor (of $y$ given $x$ ) are close in terms of mean-squared prediction error; see Remark 4.2 for details.

4 Main result

Our main result is concerned with the asymptotic distribution of the $F$ -statistic in a local neighborhood of the null-hypothesis. Here, the local neighborhood is defined through the requirement that

[TABLE]

is small. This quantity can be interpreted as a signal-to-noise ratio in (4) and depends on $\theta$ , $M$ , $\Sigma$ and $\sigma^{2}={\mathbb{E}}[\epsilon^{2}]$ ; cf. (5). If the error $e$ in (4) is Gaussian and independent of $x$ , then the $F$ -statistic $\hat{F}_{n}$ is $F$ -distributed with parameters $p$ , $n-p-1$ and non-centrality parameter $n\Delta$ ; in that case, we have ${\mathbb{P}}(\hat{F}_{n}\leq t)=F_{n,n-p-1,n\Delta}(t)$ , where $F_{n,n-p-1,n\Delta}(\cdot)$ denotes the cumulative distribution function (c.d.f.) of the $F$ -distribution with indicated parameters. In our present setting, however, the error $e$ in (4) need not be Gaussian and can depend on $x$ .

We will show that the distribution of $\hat{F}_{n}$ can be approximated by an $F$ -distribution, uniformly over most parameters in the model. Only for $\epsilon$ , $f_{\tilde{z}}$ and $R$ , i.e., for the error in (1) and for the density of the standardized explanatory variables as well as the orthogonal matrix in (2), some restrictions are needed. We will require a moment restriction on $\epsilon/\sigma$ , and we will require that $f_{\tilde{z}}$ belongs to one of the classes ${\mathcal{F}}_{d,k}(D,E)$ introduced earlier. To formulate the restriction on $R$ , write ${\mathcal{O}}_{d}$ for the collection of all orthogonal $d\times d$ matrices and write $\nu_{d}$ for the uniform distribution on that set; i.e., $\nu_{d}$ is the normalized Haar measure on the $d$ -dimensional orthogonal group. For $R$ , we will require that it belongs to a Borel set ${\mathbb{U}}\subseteq{\mathcal{O}}_{d}$ that is large in terms of $\nu_{d}$ .

Theorem 4.1.

Fix finite constants $D\geq 1$ and $E\geq 1$ , and positive finite constants $\rho\in(0,1)$ , $\lambda$ , $L$ and $\gamma$ . For each full-rank $d\times p$ matrix $M$ , each $d\times d$ variance/covariance matrix $\Sigma>0$ and each $f_{\tilde{z}}\in{\mathcal{F}}_{d,20}(D,E)$ there exists a Borel set ${\mathbb{U}}={\mathbb{U}}(M,\Sigma,f_{\tilde{z}})\subseteq{\mathcal{O}}_{d}$ so that

[TABLE]

and so that the following holds: If $\Xi_{n}$ denotes either the quantity

[TABLE]

or the quantity

[TABLE]

for some fixed $\alpha\in[0,1]$ , then

[TABLE]

This statement continues to hold if the restriction $\Delta<\gamma/\sqrt{n}$ in the last display is replaced by $\Delta<g(n)$ provided that $\lim_{n\to\infty}g(n)=0$ . [Here, the suprema are taken over all full-rank $d\times p$ matrices $M$ , all $\vartheta\in{\mathbb{R}}$ , all $d$ -vectors $\theta$ and $\mu$ , all distributions ${\mathcal{L}}(\epsilon)$ so that $\epsilon$ has mean zero and finite positive variance, and all symmetric and positive definite $d\times d$ matrices $\Sigma$ , subject to the indicated restrictions.]

Remark 4.2.

Write ${\mathcal{R}}_{N}$ and ${\mathcal{R}}_{L}$ for the prediction risk of the Bayes predictor and of the best linear predictor, respectively, of $y$ given $x$ . That is, ${\mathcal{R}}_{N}={\mathbb{E}}[(y-{\mathbb{E}}[y\|x])^{2}]$ and ${\mathcal{R}}_{L}={\mathbb{E}}[(y-(\alpha+\beta^{\prime}x))^{2}]$ . The results of Steinberger and Leeb (2018a) then entail that, in the setting of Theorem 4.1, ${\mathcal{R}}_{N}/{\mathcal{R}}_{L}$ converges to one, uniformly over all the parameters indicated in the last display of that theorem. In fact, the risk-ratio converges to one uniformly even if the restriction on $\Delta$ is removed altogether, and a similar statement holds for the ratio of conditional risks given $x$ , i.e., ${\mathbb{E}}[(y-{\mathbb{E}}[y\|x])^{2}\|x]/{\mathbb{E}}[(y-(\alpha+\beta^{\prime}x))^{2}\|x]$ . See Theorem 3.1 of Steinberger and Leeb (2018a) for a more general form of this statement under weaker assumptions.

Remark 4.3.

Although the asymptotic approximations in Theorem 4.1 require that $p$ is of the same order as $n$ , we point out that the non-central $F$ -distribution should still give a reasonable approximation to the distribution of the $F$ -statistic, i.e., the expression in (7) should be small, even if $p/n$ is very small, and, in particular, if $p$ is fixed while $n$ increases. This situation is further discussed in Steinberger (2015, p. 31, Section 3.2.2) in a setting where $n\to\infty$ , $p$ is fixed and $p/\log d\to 0$ . Clearly, the same is not true for the expression in (8), because the normal approximation to the $F$ is valid only if both degrees of freedom, i.e., $p$ and $n-p-1$ , are large. The statement regarding (8) in Theorem 4.1 coincides with the conclusion of Theorem 1 in Zhong and Chen (2011) obtained for the correctly specified Gaussian error case. Moreover, the Gaussian approximation in (8) has the advantage that it is easier to interpret than the more complicated distribution function of the non-central $F$ -distribution in (7); see also the discussion in Steinberger (2016, Remark 2.4).

5 Simulation analysis

Theorem 4.1 is an asymptotic result. In this section, we study a range of non-asymptotic scenarios through simulation to investigate how soon these asymptotic approximations become accurate. We consider a rather small sample size of $n=50$ and look at different configurations of the model dimensions $d$ and $p$ with $p<d$ , and also at different points in parameter space.

The theorem contains two asymptotic statements, one about the distribution of the $F$ -statistic and one about the size of the set $\mathbb{U}$ . For the distribution of the $F$ -statistic, we compare the rejection probability of the $F$ -test under the null hypothesis with the nominal significance level $\alpha=0.05$ . The nominal significance level provides a natural benchmark. [Clearly, one can also investigate the power of the $F$ -test through simulation experiments, but, unlike the significance level, it is less obvious what the right benchmark for the power should be.] In particular, we simulate 1000 independent realizations $F_{j,r}$ , $j=1,\dots,1000$ of the $F$ -statistic at sample size $n=50$ under the null for each point in parameter space (the index $r$ will be explained shortly), and compare the empirical significance level $\overline{p}_{r}=1000^{-1}\sum_{j=1}^{1000}{\mathbf{1}}\{F_{j,r}>F^{-1}_{p,n-p-1,0}(1-\alpha)\}$ with the nominal level $\alpha$ .

Gauging the size of $\mathbb{U}$ is more difficult, because that set is not given explicitly. We proceed as follows: We fix all the parameters in (1)–(2) except for the orthogonal matrix $R$ in (2). We then simulate 100 independent realizations $R_{r}$ of $R$ , compute $\overline{p}_{r}$ as outlined above, $r=1,\dots,100$ , and finally compute $\overline{D}=100^{-1}\sum_{r=1}^{100}|\overline{p}_{r}-\alpha|$ . If $R_{r}\in\mathbb{U}$ , then $\overline{p}_{r}$ should be close to $\alpha$ , in view of the last display in Theorem 4.1. We use $\overline{D}$ and the empirical distribution of the $\overline{p}_{r}$ , $r=1,\dots,100$ , as indicators for the size of $\mathbb{U}$ .

The remaining parameters in (1)–(2) and the submodel matrix $M$ are chosen as follows for any fixed values of $d$ and $p$ : We do not include an error term in the true model, i.e., we set $\sigma^{2}=0$ , because the effect of misspecification becomes more pronounced when the error variance $\sigma^{2}$ is small.111 Note that if the error variance $\sigma^{2}=\operatorname{Var}[\epsilon_{i}]$ in the true model $y_{i}=\theta^{\prime}z_{i}+\epsilon_{i}$ is overly large, i.e., much larger than $\theta^{\prime}\Sigma\theta$ , then the scaled true model is essentially given by $y_{i}/\sigma\approx\epsilon_{i}/\sigma$ . Since the $F$ -statistic is scale-invariant and $\epsilon$ is independent of $X$ , we then have $\hat{F}(X,Y)=\hat{F}(X,Y/\sigma)\approx\hat{F}(X,(\epsilon_{i})_{i=1}^{n}/\sigma)=\hat{F}(X,(\epsilon_{i})_{i=1}^{n})$ . In that case, the $F$ -statistic will essentially follow the null-distribution and we expect a rejection probability close to the nominal level, irrespective of $\theta$ and $R$ .

[Note that the case where $\sigma^{2}=0$ is not covered by Theorem 4.1, but inspection of the proof shows that our results also apply in this case; cf. Remark A.3.] For $\tilde{z}$ , we consider product distributions with zero mean and i.i.d. components from the student- $t$ distribution with $2$ , $3$ and $5$ degrees of freedom, as well as from the centered exponential, uniform, Bernoulli $\{-1,1\}$ and Gaussian distributions. [Note that the scaling of these distributions is inconsequential, because of the scale-invariance of the $F$ -statistic $\hat{F}(X,Y)$ in both arguments and the fact that we do not include an error term in the full model, i.e., scaling of $\tilde{z}_{i}$ is equivalent to scaling of both $y_{i}=\theta^{\prime}z_{i}$ and $x_{i}=B^{\prime}z_{i}$ . Similarly, also the scaling of $\theta$ and $\Sigma$ has no impact on the value of the $F$ -statistic.] For $\Sigma$ , we chose a spiked covariance matrix $\Sigma=U\text{diag}(\lambda_{1},\dots,\lambda_{n})U^{\prime}$ with eigenvalues $\lambda_{1}=\lambda_{2}=400$ and $\lambda_{3}=\dots=\lambda_{d}=1$ and an orthogonal matrix of eigenvectors $U$ chosen randomly from the uniform distribution on the orthogonal group.222 The spiked covariance model corresponds to a factor model where the identity matrix is perturbed by a low rank matrix. It has received much attention in the literature on high dimensional random matrices (e.g., Baik and Silverstein, 2006; Cai et al., 2013; Donoho et al., 2013; Johnstone, 2001). We have repeated the simulations also with covariance matrices of an AR $(1)$ process and obtained essentially the same results.

The intercept terms $\vartheta$ and $\mu$ are set to zero, for convenience. For the matrix $M$ , which describes the working model, we take $M$ equal to the $d\times p$ matrix whose $k$ -th column is the $k$ -th standard basis vector in $\mathbb{R}^{d}$ , $1\leq k\leq p$ . In other words, we consider a sub-model that includes only the first $p$ regressors (out of $d$ ). For the parameter $\theta\in\mathbb{R}^{d}$ , we need to ensure that the null hypothesis is satisfied, i.e., that $\beta=(M^{\prime}\Sigma M)^{-1}M^{\prime}\Sigma\theta=0$ . By construction of $\Sigma$ , $M^{\prime}\Sigma M$ is regular, and we choose $\theta=(I_{d}-P_{\Sigma M})V/\|(I_{d}-P_{\Sigma M})V\|$ , for one realization of $V\thicksim N(0,I_{d})$ , to guarantee that $M^{\prime}\Sigma\theta=0$ .

The results of the simulations are summarized in Table 1 and Figures 1 and 2. From Table 1, the overall picture we get is consistent with what was predicted by our theory. For all distributions except the Gaussian, the average absolute difference between the true (simulated) rejection probabilities and the nominal level decreases as $d$ increases. This phenomenon is most pronounced for the exponential distribution, which has a finite moment generating function around the origin, and is weakest for the $t(2)$ -distribution, which does not even have finite variance. For uniformly distributed design, which is bounded, the effect of misspecification on the size of the $F$ -test is relatively mild already for small dimensions. In the Gaussian case, all sub-models of the form (4) are correct in the sense that the error $e$ is Gaussian with mean zero and independent of $x$ , so that theoretically the corresponding panel in Table 1 should contain only zeroes. The numbers therefore represent only the simulation error and serve as a benchmark for the other panels. We also see a monotonic increase, in the deviation of the size of the $F$ -test from the nominal level, as the dimension $p$ of the sub-model increases, which was also suggested by our theory. However, if we fix the ratio $p/d=1/2$ , i.e., if we move along the staircase pattern in each of the panels, except for the heavy tailed distributions $t(3)$ and $t(2)$ , we still see the effect of misspecification decrease as $d$ increases. This suggests that convergence of $n^{2}/\log(d)\sim p^{2}/\log(d)$ to zero, as required in Theorem 4.1, may not be necessary, at least in the scenarios considered here.

In Table 1, the effect of the orthogonal matrix $R$ on the actual significance level of the $F$ -test was compressed into one summary statistic, namely the mean absolute deviation from the nominal significance level. To get a more comprehensive picture, Figures 1 and 2 show plots of the sample $(\bar{p}_{r})_{r=1}^{100}$ (gray crosses) and superimposed box-plots for different design distributions. Due to limited space we present only the results for sub-models of dimension $p=5$ . In view of Theorem 4.1, we expect that the size $\mathbb{U}$ , i.e., the family of matrices $R$ for which (7) and (8) get small, grows with $d$ . Consequently, we expect that many of the $\bar{p}_{r}$ should be close to $\alpha=0.05$ . On the other hand, if $d$ is not large then many matrices $R$ will lead to a biased rejection probability due to misspecification of the working model. This is exactly what we observe in Figures 1 and 2. For small values of $d$ , the rejection probabilities $\bar{p}_{r}$ are systematically biased and we see some variability of their values due to the variation in the choice of $R_{r}$ (compare benchmark panel in Figure 2). Both the bias and the variability in $\bar{p}_{r}$ reduce when $d$ increases, which is what we expected, as for large $d$ , most $R_{r}$ will be favorable and we obtain small misspecification errors uniformly over these favorable $R_{r}$ . What is remarkable is the systematic over-rejection in case of the $t$ - and exponential distribution and the under-rejection for Bernoulli and uniformly distributed designs. We currently can not explain the mechanism that is responsible for this pattern. Finally, the benchmark panel shows i.i.d. samples $(\tilde{p}_{r})_{r=1}^{100}$ with $\tilde{p}_{r}\thicksim\text{Binomial}(1000,\alpha)/1000$ and success probabilities $\alpha=0.05,0.1,0.15,0.2$ . This provides some idea what portion of the variability observed in the other panels is due to random simulation error. Clearly, the results in the benchmark panel could have been equivalently obtained by repeating the previous simulation for the $F$ -test with Gaussian design at significance levels $\alpha=0.05,0.1,0.15,0.2$ .

Acknowledgments

The first author’s research was partially supported by FWF projects P 26354-N26 and P 28233-N32.

Appendix A Proofs

We begin with some preliminary considerations that connect this paper with the results of Steinberger and Leeb (2018b). In particular, we use Theorem 2.1, parts (ii) and (iii), in that reference with $Z=\tilde{z}$ and $\tau=1/2$ : If $f_{\tilde{z}}\in\mathcal{F}_{d,20}(D,E)$ , then the assumptions of that result are satisfied in view of Example 3.1 in Steinberger and Leeb (2018b). The theorem guarantees existence of a Borel subset $\mathbb{G}=\mathbb{G}(f_{\tilde{z}})\subseteq\mathcal{V}_{d,p}$ of the Stiefel manifold $\mathcal{V}_{d,p}$ of order $d\times p$ , that depends on the density $f_{\tilde{z}}$ , such that for all $t>0$ both

[TABLE]

and

[TABLE]

are bounded from above by

[TABLE]

such that

[TABLE]

where $\nu_{d,p}$ denotes the uniform distribution on the Stiefel manifold, and such that the set $\mathbb{G}$ is right-invariant under the action of $\mathcal{O}_{p}$ , i.e., $\mathbb{G}R=\mathbb{G}$ whenever $R\in\mathcal{O}_{d}$ . Here, the constant $\gamma=\gamma(D)$ depends only on $D$ , and the constant $\kappa=\kappa(E)$ depends only on $E$ .

For any full rank $d\times p$ matrix $M$ , any symmetric positive definite $d\times d$ matrix $\Sigma$ and $f_{\tilde{z}}\in\mathcal{F}_{d,20}(D,E)$ , we define the set

[TABLE]

Now take a random matrix $U$ that is uniformly distributed on $\mathcal{O}_{d}$ and another random matrix $V$ that is uniformly distributed on $\mathcal{O}_{p}$ , such that $U$ and $V$ are independent, and note that by right-invariance of $\mathbb{G}$ ,

[TABLE]

because $\Sigma^{1/2}M(M^{\prime}\Sigma M)^{-1/2}\in\mathcal{V}_{d,p}$ and $\nu_{d,p}$ is characterized by left and right invariance under the appropriate orthogonal groups. It follows that $\nu_{d}(\mathbb{U}^{c})$ is bounded by the expression on the right-hand side of (11) whenever $f_{\tilde{z}}\in\mathcal{F}_{d,20}(D,E)$ , which establishes the first claim of Theorem 4.1. The proof of the second claim is more elaborate.

The results in the preceding paragraph also show that the error $e$ in the working model (4) is such that ${\mathbb{E}}[e\|x]$ is approximately zero and $\operatorname{Var}[e\|x]$ is approximately constant, provided that $R\in\mathbb{U}$ : We first re-write the error $e$ in a convenient form. Set $\tilde{\theta}=R^{\prime}\Sigma^{1/2}\theta$ and $\tilde{M}=R^{\prime}\Sigma^{1/2}M$ . Then it is easy to see that $e=\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\tilde{z}+\epsilon$ and hence

[TABLE]

see also (4)–(5). Our goal is to show that the expressions in the preceding two displays are approximately zero. To this end, we focus on the expressions in curly brackets and use Cauchy-Schwarz: For each $t>0$ , we have

[TABLE]

Now if $R\in\mathbb{U}(M,\Sigma,f_{\tilde{z}})$ , then it is easy to see that $\tilde{M}(\tilde{M}^{\prime}\tilde{M})^{-1/2}\in\mathbb{G}(f_{\tilde{z}})$ . Because conditioning on $P_{\tilde{M}}\tilde{z}$ is equivalent to conditioning on $(\tilde{M}^{\prime}\tilde{M})^{-1/2}\tilde{M}^{\prime}\tilde{z}$ , it follows that ${\mathbb{P}}(|{\mathbb{E}}[e\|x]|>t)$ is bounded from above by (9) with $t$ replaced by $t/\|(I_{d}-P_{\tilde{M}})\tilde{\theta}\|$ and that ${\mathbb{P}}(|{\mathbb{E}}[e^{2}\|x]-s^{2}|>t)$ is bounded by (9) with $t$ replaced by $t/\|(I_{d}-P_{\tilde{M}})\tilde{\theta}\|^{2}$ .

The consideration in the preceding paragraph suggests that the effect of misspecification in (4), where ${\mathbb{E}}[e\|x]$ may be non-zero and $\operatorname{Var}[e\|x]$ may be non-constant, may be negligible in an asymptotic setting where $p/\log d$ becomes small, provided that $f_{\tilde{z}}\in\mathcal{F}_{d,20}$ and that $R\in\mathbb{U}(M,\Sigma,f_{\tilde{z}})$ . This idea is formalized in the following two results, which show that the distribution of certain statistics is unaffected asymptotically if the error $e$ is replaced by a substitute error $e^{\ast}$ that has mean zero and constant variance conditional on $x$ . The following results are stated for sequences where the data-generating model (1)-(2) and the working model (4) are allowed to depend on $n$ , that is, a ‘triangular array’ setting where all parameters depend on $n$ .

Lemma A.1.

Fix finite positive constants $D$ and $E$ . For every $n\in{\mathbb{N}}$ , let $p_{n}\leq d_{n}$ be positive integers so that $np_{n}/\log d_{n}\to 0$ as $n\to\infty$ . For each $n$ , consider $(y,z,x)$ as in (1)–(3) but with $d_{n}$ and $p_{n}$ replacing $d$ and $p$ , respectively, with $f_{\tilde{z}}\in\mathcal{F}_{d_{n},20}(D,E)$ and with $R\in\mathbb{U}(M,\Sigma,f_{\tilde{z}})$ . And for each $n$ , consider a sample of $n$ i.i.d. observations $(y_{i},z_{i},x_{i})$ , $1\leq i\leq n$ , of $(y,z,x)$ , stack the values of the individual variables into a vector $Y$ and matrices $Z$ and $X$ , respectively, and write $E=Y-\alpha\iota-X\beta=(e_{1},\dots,e_{n})^{\prime}$ for the vector of errors from (4). Finally, define a vector $E^{\ast}=(e^{\ast}_{1},\dots,e^{\ast}_{n})^{\prime}$ of substitute errors through $e_{i}^{\ast}=s(\operatorname{Var}[e_{i}\|x_{i}])^{-1/2}(e_{i}-{\mathbb{E}}[e_{i}\|x_{i}])$ . Then, for every $k\in{\mathbb{R}}$ and (possibly random) symmetric idempotent $n\times n$ matrices $P_{n}$ ,

[TABLE]

as $n\to\infty$ . As a by product, we also obtain that

[TABLE]

Proof.

First, note that $\operatorname{Var}[e_{i}\|x_{i}]=\operatorname{Var}[y_{i}\|x_{i}]=\operatorname{Var}[\theta^{\prime}z_{i}\|x_{i}]+\sigma^{2}>0$ , so that $e_{i}^{*}$ is well defined (almost surely). For the claim in (13), fix $k\in{\mathbb{R}}$ and $t>0$ , and consider ${\mathbb{P}}(n^{k}\|E-E^{*}\|/s>t)\leq n{\mathbb{P}}(n^{2k+1}|e_{1}-e_{1}^{*}|^{2}/s^{2}>t^{2})$ . Now, using the simple observation $|\sqrt{\operatorname{Var}[e_{1}\|x_{1}]}-s|=|\operatorname{Var}[e_{1}\|x_{1}]-s^{2}|/|\sqrt{\operatorname{Var}[e_{1}\|x_{1}]}+s|\leq|\operatorname{Var}[e_{1}\|x_{1}]-s^{2}|/s$ , we get

[TABLE]

and furthermore

[TABLE]

The claim (13) will follow if each of the four terms in (LABEL:tmp2) is of the order $o(1/n)$ . Because $f_{\tilde{z}}\in\mathcal{F}_{d_{n},20}(D,M)$ and $R\in\mathbb{U}(M,\Sigma,f_{\tilde{z}})$ , the considerations leading up to Lemma A.1 apply. Also note that $\|(I_{d}-P_{\tilde{M}})\tilde{\theta}\|^{2}\leq s^{2}$ . For the last term in (LABEL:tmp2), we obtain, for every $t>0$ , that

[TABLE]

and the upper bound goes to zero as $o(1/n)$ in view of the assumption that $np_{n}/\log d_{n}\to 0$ . For the second-to-last term in (LABEL:tmp2), we have ${\mathbb{P}}(|e_{1}|/s>n)\leq n^{-2}{\mathbb{E}}[e_{1}^{2}/s^{2}]=1/n^{2}$ . For the second term in (LABEL:tmp2), we proceed like for the last term in (LABEL:tmp2). In particular, we obtain, for any $t>0$ , that

[TABLE]

Again, this upper bound goes to zero as $o(1/n)$ because $np_{n}/\log d_{n}\to 0$ . Note that the considerations in the preceding display also entail that ${\mathbb{P}}(\max_{i=1,\dots,n}|\operatorname{Var}[e_{i}\|x_{i}]/s^{2}-1|>t)\leq n{\mathbb{P}}(|\operatorname{Var}[e_{1}\|x_{1}]/s^{2}-1|>t)\to 0$ .

For the claim in (14), write

[TABLE]

and note that by definition of $e_{1}^{*}$ and the variance decomposition formula, we have ${\mathbb{E}}[e_{1}^{*}]={\mathbb{E}}[{\mathbb{E}}[e_{1}^{*}\|x_{1}]]=0$ and $\operatorname{Var}[e_{1}^{*}]={\mathbb{E}}[\operatorname{Var}[e_{1}^{*}\|x_{1}]]+\operatorname{Var}[{\mathbb{E}}[e_{1}^{*}\|x_{1}]]=s^{2}$ , so that by independence $\|E^{*}\|/s=O_{\mathbb{P}}(\sqrt{n})$ . Premultiplying by $n^{k}/s^{2}$ in the previous display and applying (13) finishes the proof of the second claim. ∎

Lemma A.2.

Fix $K\in(0,\infty)$ and an integer $l\geq-1$ . Under the assumptions and in the notation of Lemma A.1, assume that ${\mathbb{E}}[|\epsilon/\sigma|^{4}]\leq K$ for each $n$ , that $\Delta=\operatorname{Var}(\beta^{\prime}x)/\operatorname{Var}(e)=O(n^{l})$ and that $\limsup_{n\to\infty}p_{n}/n<1$ . Define substitute data $Y^{\ast}=\iota\alpha+X\beta+E^{\ast}$ . Then, for every $k\in\mathbb{R}$ , we have

[TABLE]

as $n\to\infty$ .

Proof.

The idea is to use Lemma A.1 to approximate $\hat{F}_{n}(X,Y)$ by $\hat{F}_{n}(X,Y^{*})$ . In particular, we will show that on some event $C_{n}$ to be defined below, we have

[TABLE]

where $\delta_{n}^{(1)}$ converges to one and $\delta_{n}^{(2)}$ converges to zero, both at an arbitrary polynomial rate in $n$ , and where $\hat{F}_{n}(X,Y^{*})/n^{l+1}=O_{\mathbb{P}}(1)$ . The probability of $C_{n}$ will be shown to converge to one. The claim of the lemma follows from this.

Set $U=[\iota,X]$ , where $\iota=(1,\dots,1)^{\prime}\in{\mathbb{R}}^{n}$ . With this, define the event $C_{n}=\{\det{U^{\prime}U}\neq 0,E^{\prime}(I_{n}-P_{U})E>0,{E^{*}}^{\prime}(I_{n}-P_{U})E^{*}>0\}$ . On $C_{n}$ , by block matrix inversion, we have $[0,I_{p_{n}}](U^{\prime}U)^{-1}U^{\prime}=[X^{\prime}(I_{n}-P_{\iota})X]^{-1}X^{\prime}(I_{n}-P_{\iota})$ . Using the abbreviation $V=(I_{n}-P_{\iota})X$ , we thus see that $\hat{\beta}=\beta+(V^{\prime}V)^{-1}V^{\prime}E$ and that the $F$ -statistic $\hat{F}_{n}(X,Y)$ can be written as

[TABLE]

This establishes a representation $\hat{F}_{n}(X,Y)=\delta_{n}^{(1)}\hat{F}_{n}(X,Y^{*})+\delta_{n}^{(2)}$ on $C_{n}$ . On the complement of $C_{n}$ , we set $\delta_{n}^{(1)}=\delta_{n}^{(2)}=0$ , say. We next show that for every fixed $k\in{\mathbb{R}}$ , $n^{k}(\delta_{n}^{(1)}-1)=o_{\mathbb{P}}(1)$ and $n^{k}\delta_{n}^{(2)}=o_{\mathbb{P}}(1)$ .

To verify the claimed properties of these quantities, on $C_{n}$ , consider first

[TABLE]

Using Lemma A.1, we see that the first fraction in this representation multiplied by $n^{k}$ converges to zero in probability. The second fraction obviously equals $s^{2}/\hat{s}^{2}$ . Define $\hat{s}^{*2}$ like $\hat{s}^{2}$ (see the discussion following (6)) but with $Y^{*}$ replacing $Y$ . We show that $\hat{s}^{2}/s^{2}=\hat{s}^{*2}/s^{2}+(\hat{s}^{2}-\hat{s}^{*2})/s^{2}\to 1$ in probability. To see this, first note that the convergence to zero of $(\hat{s}^{2}-\hat{s}^{*2})/s^{2}$ follows again from Lemma A.1. For the ratio $\hat{s}^{*2}/s^{2}$ , convergence to $1$ in probability follows, e.g., from Lemma C.1 in Steinberger (2016), upon verifying its assumptions. To this end, it remains to show that $n^{-1}\sum_{i=1}^{n}{\mathbb{E}}[(e^{*}_{i}/s)^{4}\|x_{i}]=O_{\mathbb{P}}(1)$ . Using $(a+b)^{4}\leq 2^{3}(a^{4}+b^{4})$ , for $a,b\in{\mathbb{R}}$ , we have

[TABLE]

The maximum in the preceding display converges to one in probability if $\min_{j}\operatorname{Var}[e_{j}/s\|x_{j}]$ converges to one in probability, which follows from Lemma A.1. The arithmetic mean of the conditional fourth moments is $O_{\mathbb{P}}(1)$ if the unconditional mean of forth moments is bounded in $n$ . To this end, note that we have $e=\tilde{\theta}^{\prime}(I_{d}-P_{\tilde{M}})\tilde{z}+\epsilon$ and $s^{2}=\|(I_{d}-P_{\tilde{B}})\tilde{\theta}\|^{2}+\sigma^{2}$ ; cf. (5) and the discussion right before (12). With this, we get

[TABLE]

and take expectations. The claim follows now from ${\mathbb{E}}[(\epsilon_{i}/\sigma)^{4}]\leq K$ and the fact that the fourth spherical moment of $\tilde{z}_{i}$ is uniformly bounded in view of Rosenthal’s inequality (Rosenthal, 1970, Theorem 3) and the assumption that $f_{\tilde{z}}\in\mathcal{F}_{d_{n},20}(D,E)$ . Note that this also entails ${\mathbb{P}}(C_{n}^{c})\leq{\mathbb{P}}(\hat{s}^{*2}=0)+{\mathbb{P}}({\hat{s}_{n}}^{2}=0)\leq{\mathbb{P}}(|\hat{s}^{*2}/s^{2}-1|>1/2)+{\mathbb{P}}(|{\hat{s}}^{2}/s^{2}-1|>1/2)\to 0$ .

To see that also $\delta_{n}^{(2)}$ behaves as desired, first note that on $C_{n}$ ,

[TABLE]

The factor $n^{k}/p_{n}$ can be bounded by $\kappa n^{k-1}$ for some constant $\kappa$ by assumption; the ratio $s^{2}/\hat{s}^{2}$ was shown to converge to one in probability in the preceding paragraph. The difference of quadratic forms converges to zero in probability by Lemma A.1, even when multiplied by $\kappa n^{k-1}$ . Noting that $\|V\beta\|=\|(I_{n}-P_{\iota})X\beta\|\leq\|(I_{n}-P_{\iota})X(\tilde{M}^{\prime}\tilde{M})^{-1/2}\|\|(\tilde{M}^{\prime}\tilde{M})^{1/2}\beta\|$ , the scaled second term in parentheses, i.e., $(n^{k}/p_{n})2(E-E^{*})^{\prime}V\beta/s^{2}$ , can be bounded by

[TABLE]

where $n^{k+l/2}\|E-E^{*}\|/s$ converges to zero in probability by Lemma A.1 and $n^{-l}\beta^{\prime}(\tilde{M}^{\prime}\tilde{M})\beta/s^{2}=n^{-l}\Delta=O(1)$ by assumption. It remains to show that the largest singular value of $(I_{n}-P_{\iota})X(\tilde{M}^{\prime}\tilde{M})^{-1/2}/n$ is bounded in probability. Due to the projection onto the orthogonal complement of $\iota$ , the distribution of this quantity does not depend on the parameter $\mu$ , which is why we may assume that $\mu=0$ for this part of the argument. Abbreviate $\bar{X}=X(\tilde{M}^{\prime}\tilde{M})^{-1/2}$ , $\bar{x}_{i}=(\tilde{M}^{\prime}\tilde{M})^{-1/2}x_{i}$ and consider $\|(I_{n}-P_{\iota})\bar{X}/n\|^{2}\leq\operatorname{trace}(\bar{X}^{\prime}\bar{X}/n^{2})=\sum_{i=1}^{n}\|\bar{x}_{i}\|^{2}/n^{2}$ . Taking expectation, noting that ${\mathbb{E}}[\|\bar{x}_{1}\|^{2}]=p_{n}$ and $p_{n}/n=O(1)$ , we arrive at the desired boundedness in probability.

It remains to show that $\hat{F}_{n}(X,Y^{*})/n^{l+1}=O_{\mathbb{P}}(1)$ . To this end, recall that $\hat{s}^{*2}/s^{2}\to 1$ in probability, and one easily verifies that

[TABLE]

here, the first equality is obtained by arguing as in the first paragraph of the proof but with $Y^{\ast}$ replacing $Y$ , and the second equality follows upon noting that $\beta^{\prime}V^{\prime}V\beta=\operatorname{trace}(I_{n}-P_{\iota})X\beta\beta^{\prime}X^{\prime}$ and that $X\beta$ is a vector with i.i.d. components, each of which has variance $\beta^{\prime}M^{\prime}\Sigma M\beta=s^{2}\Delta$ . ∎

Proof of Theorem 4.1.

Define $\mathbb{U}=\mathbb{U}(M,\Sigma,f_{\tilde{z}})$ as in the beginning of the appendix and note that the first statement in the theorem, concerning $\nu_{d}(\mathbb{U})$ , has already been established there. For the second statement, concerning $\Xi_{n}$ , let $p_{n}\leq d_{n}$ be positive integers so that $n^{2}p_{n}/\log d_{n}\to 0$ and so that $p_{n}/n\to\rho\in(0,1)$ as $n\to\infty$ . For each $n$ , consider a sample of i.i.d. observations $(y_{i},z_{i},x_{i})$ , $1\leq i\leq n$ , as in Lemma A.1, so that the underlying quantities (i.e., $M$ , $\vartheta$ , $\theta$ , $\mathcal{L}(\epsilon)$ , $\mu$ , $\Sigma$ , $\Delta$ , $f_{\tilde{z}}$ , and $R$ ) satisfy the restrictions in the suprema in the last display of Theorem 4.1. For given $M$ , we stress that the restriction on $\Delta$ implicitly also restricts the parameters $\theta$ , $\Sigma$ and $\sigma^{2}$ ; see the definition of $\Delta$ at the beginning of Section 4 as well as the relations in (5). We have to show that $\Xi_{n}\to 0$ as $n\to\infty$ .

Set $a_{n}=2(1/p_{n}+1/(n-p_{n}-1))$ and $b_{n}=\sqrt{\frac{(1-(p_{n}+1)/n)(1-1/n)}{2p_{n}/n}}$ for each $n$ , and define $Y^{\ast}$ for each $n$ as in Lemma A.2. We first show that

[TABLE]

by verifying the assumptions of Theorem 2.1(i) in Steinberger (2016) for the sample $(y_{i}^{*},x_{i})_{i=1}^{n}$ , with the symbols $s_{n}$ , $\Delta_{\gamma}$ and $R_{0}$ in that reference equal to $a_{n}$ , $\Delta$ , and $[0,I_{p_{n}}]$ , respectively. In particular, we need to verify conditions (A1).(a,b,c,d) and (A2) in that reference. The design conditions (A1).(a,c,d) are easily verified by use of Lemma A.2(i) in Steinberger (2016). And our assumptions that $f_{\tilde{z}}\in\mathcal{F}_{d_{n},20}(D,M)$ and that $p_{n}<n-1$ imply condition (A1).(b). Assumption (A2) on the scaled errors $e_{i}^{*}/s$ is established by an argument similar to the one also used in the third paragraph of the proof of Lemma A.2 but for the $(8+\kappa)$ -th moment instead of the fourth moment: Simply decompose $e_{i}^{*}=e^{\circ}_{i}\tilde{{\varepsilon}}_{i}$ , with $e^{\circ}_{i}=\sqrt{s^{2}/\operatorname{Var}[e_{i}\|x_{i}]}$ and $\tilde{{\varepsilon}}_{i}=e_{i}-{\mathbb{E}}[e_{i}\|x_{i}]$ , and use Lemma A.1 as before to get $\max_{i=1,\dots,n}e^{\circ}_{i}\to 1$ in probability. Then, the assumption that ${\mathbb{E}}[|\epsilon/\sigma|^{8+\kappa}]\leq K$ and the fact that the marginals of $\tilde{z}\in\mathcal{F}_{d_{n},20}(D,M)$ have bounded 20th moment, together with Rosenthal’s inequality establish the boundedness of ${\mathbb{E}}[|\tilde{{\varepsilon}}_{i}/s|^{8+\kappa}]$ , which is sufficient for (A2). Using Lemma A.2 and noting that $a_{n}^{-1/2}=n^{k}(1+o(1))$ for some $k\in\mathbb{R}$ , it follows that (18) continues to hold with $\hat{F}_{n}(X,Y)$ replacing $\hat{F}_{n}(X,Y^{*})$ .

Now standard arguments conclude the proof: First, note that an appropriately scaled and centered $F$ -distributed random variable $\mathcal{F}_{p_{n},n-p_{n}-1,n\Delta}$ with $p_{n}$ and $n-p_{n}-1$ degrees of freedom and non-centrality parameter $n\Delta$ is also asymptotically normal, i.e.,

[TABLE]

because $p_{n}/n\to\rho\in(0,1)$ implies that $p_{n}\to\infty$ . Hence, we have

[TABLE]

and the last two suprema converge to zero in view of Polya’s theorem, which establishes the $\Xi_{n}\to 0$ in case $\Xi_{n}$ equals (7). Finally, it is elementary to verify that $\Xi_{n}$ also converges to zero in case $\Xi_{n}$ equals (8): This follows from (19) with $\hat{F}_{n}(X,Y)$ replacing $\hat{F}_{n}(X,Y^{*})$ , because the quantiles of the central $F$ -distribution satisfy $a_{n}^{-1/2}(F^{-1}_{p_{n},n-p_{n}-1,0}(\alpha))\to\Phi^{-1}(\alpha)$ . ∎

Remark A.3.

Inspection of the proof reveals that the assumption that $\sigma^{2}$ is positive is used only to guarantee that $\operatorname{Var}[e\|x]>0$ almost surely (and hence also $s^{2}=\operatorname{Var}[e]>0$ ). If this assumption is dropped, we thus see that $\Xi_{n}$ (defined in Theorem 4.1) converges to zero along sequences of parameters as used in the proof of Theorem 4.1, provided that $\operatorname{Var}[\theta^{\prime}z\|x]>0$ almost surely for each $n$ (as then $\operatorname{Var}[e\|x]=\operatorname{Var}[y\|x]>0$ a.s.).

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadie et al. (2014) G. Abadie, G. W. Imbens, and F. Zheng. Inference for misspecified models with fixed regressors. J. Amer. Statist. Assoc. , 109 :1601–1614, 2014.
2Akritas and Arnold (2000) M. Akritas and S. Arnold. Asymptotics for analyis of variance when the number of levels is large. J. Amer. Statist. Assoc. , 95 :212–226, 2000.
3Anderson (1958) T. W. Anderson. An introduction to multivariate analysis . Wiley, New York, NY, 1958.
4Bachoc et al. (2015) F. Bachoc, H. Leeb, and B. M. Pötscher. Valid confidence intervals for post-model-selection prediction. ar Xiv:1412.4605, 2015.
5Bai and Saranadasa (1996) Z. Bai and H. Saranadasa. Effect of high dimension: By an example of a two sample problem. Stat. Sinica , 6 :311–329, 1996.
6Baik and Silverstein (2006) J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivar. Anal. , 97 :1382–1408, 2006.
7Bathke and Lankowski (2005) A. Bathke and D. Lankowski. Rank procedures for a large number of treatments. J. Statist. Plann. Inference , 133 :223–238, 2005.
8Boos and Brownie (1995) D. D. Boos and C. Brownie. ANOVA and rank tests when the number of treatments is large. Statist. Probab. Lett. , 23 :183–191, 1995.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Statistical inference with FFF-statistics

Abstract

1 Introduction

2 The true model

3 The sub-model and the FFF-test

4 Main result

Theorem 4.1**.**

Remark 4.2**.**

Remark 4.3**.**

5 Simulation analysis

Acknowledgments

Appendix A Proofs

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Proof of Theorem 4.1.

Remark A.3**.**

Statistical inference with $F$ -statistics

3 The sub-model and the $F$ -test

Theorem 4.1.

Remark 4.2.

Remark 4.3.

Lemma A.1.

Lemma A.2.

Remark A.3.