Optimal Sparsity Testing in Linear regression Model

Alexandra Carpentier; Nicolas Verzelen

arXiv:1901.08802·math.ST·April 24, 2020

Optimal Sparsity Testing in Linear regression Model

Alexandra Carpentier, Nicolas Verzelen

PDF

TL;DR

This paper investigates the fundamental limits of testing the sparsity level in high-dimensional linear regression, providing minimax separation distances for different scenarios with known and unknown parameters.

Contribution

It precisely characterizes the minimax separation distances for sparsity testing in high-dimensional linear regression under various conditions, highlighting the influence of null and alternative sparsity levels.

Findings

01

Minimax separation distances depend on null and alternative sparsity levels.

02

Different scenarios show distinct separation distances based on knowledge of covariance and noise.

03

Both null and alternative hypotheses' sizes are crucial in the testing problem.

Abstract

We consider the problem of sparsity testing in the high-dimensional linear regression model. The problem is to test whether the number of non-zero components (aka the sparsity) of the regression parameter $θ^{*}$ is less than or equal to $k_{0}$ . We pinpoint the minimax separation distances for this problem, which amounts to quantifying how far a $k_{1}$ -sparse vector $θ^{*}$ has to be from the set of $k_{0}$ -sparse vectors so that a test is able to reject the null hypothesis with high probability. Two scenarios are considered. In the independent scenario, the covariates are i.i.d. normally distributed and the noise level is known. In the general scenario, both the covariance matrix of the covariates and the noise level are unknown. Although the minimax separation distances differ in these two scenarios, both of them actually depend on $k_{0}$ and $k_{1}$ illustrating that for this…

Tables2

Table 1. Table 1: Square minimax separation distances ρ γ ∗ 2 [ k 0 , Δ ] subscript superscript 𝜌 absent 2 𝛾 subscript 𝑘 0 Δ \rho^{*2}_{\gamma}[k_{0},\Delta] in the independent setting for k 0 ∈ [ 1 , p − 1 ] subscript 𝑘 0 1 𝑝 1 k_{0}\in[1,p-1] and Δ ∈ [ 1 , p − k 0 ] Δ 1 𝑝 subscript 𝑘 0 \Delta\in[1,p-k_{0}] when p ≥ n 1 + ζ 𝑝 superscript 𝑛 1 𝜁 p\geq n^{1+\zeta} with a fixed ζ > 0 𝜁 0 \zeta>0 . Separation distances are given up to constants that may depend on γ 𝛾 \gamma and ζ 𝜁 \zeta .

$k_{0}$	$Δ$	$ρ_{γ}^{* 2} [k_{0}, Δ]$
$k_{0} \leq p^{1 / 2 - ζ}$	$1 \leq Δ \leq k_{0} + \frac{\sqrt{n}}{\log (p)}$	$\frac{Δ \log (p)}{n}$
	$k_{0} + \frac{\sqrt{n}}{\log (p)} \leq Δ \leq p - k_{0}$	$\frac{1}{\sqrt{n}} + \frac{k_{0} \log (p)}{n}$
$p^{1 / 2 + ζ} \leq k_{0} \leq c_{γ} \frac{n}{\log (p)}$	$1 \leq Δ \leq k_{0} p^{- ζ}$	$\frac{Δ \log (p)}{n}$
	$k_{0} \leq Δ \leq p - k_{0}$	$\frac{k_{0}}{n \log (p)}$

Table 2. Table 2: Square minimax separation distances in the general setting (in the ≍ γ , η subscript asymptotically-equals 𝛾 𝜂 \asymp_{\gamma,\eta} sense, see Subsection 1.5 ). We report in this table only the case where n 1 + ζ ≤ p ≤ n 2 − ζ superscript 𝑛 1 𝜁 𝑝 superscript 𝑛 2 𝜁 n^{1+\zeta}\leq p\leq n^{2-\zeta} , where ζ ∈ ( 0 , 1 ) 𝜁 0 1 \zeta\in(0,1) can be chosen arbitrarily small. LB stands for Lower bound and UB stands for upper bound.

$k_{0}$	$Δ$	$𝝆_{g, γ}^{* 2} [k_{0}, Δ]$
$k_{0} \leq p^{1 / 2 - ζ}$	$1 \leq Δ \leq p^{1 / 2 - ζ} \land k_{0}$	$\frac{Δ \log (p)}{n}$
	$p^{1 / 2 + ζ} \land k_{0} \leq Δ \leq p - k_{0}$	$\frac{\sqrt{p}}{n}$
$p^{1 / 2 + ζ} \leq k_{0} \leq c_{γ} \frac{n}{\log (p)}$	$1 \leq Δ \leq k_{0} p^{- ζ}$	$\frac{Δ \log (p)}{n}$
	$k_{0} \leq Δ \leq p - k_{0}$	LB : $\frac{k_{0}}{n \log (p)}$
		UB : $\frac{k_{0} \log (p)}{n}$

Equations415

Y = X θ^{*} + σ ϵ,

Y = X θ^{*} + σ ϵ,

R (ϕ; k_{0}, Δ, ρ) := θ^{*} \in B_{0} [k_{0}] sup P_{θ^{*}, I_{p}, σ} [ϕ = 1] + θ^{*} \in B_{0} [k_{0} + Δ], d_{2} (θ^{*}; B_{0} [k_{0}]) \geq ρ σ sup P_{θ^{*}, I_{p}, σ} [ϕ = 0],

R (ϕ; k_{0}, Δ, ρ) := θ^{*} \in B_{0} [k_{0}] sup P_{θ^{*}, I_{p}, σ} [ϕ = 1] + θ^{*} \in B_{0} [k_{0} + Δ], d_{2} (θ^{*}; B_{0} [k_{0}]) \geq ρ σ sup P_{θ^{*}, I_{p}, σ} [ϕ = 0],

ρ_{γ}^{*} [k_{0}, Δ] := ϕ in f ρ_{γ} (ϕ; k_{0}, Δ),

ρ_{γ}^{*} [k_{0}, Δ] := ϕ in f ρ_{γ} (ϕ; k_{0}, Δ),

U (η) = {Σ : η^{- 1} \leq η_{m i n} (Σ) \leq η_{m a x} (Σ) \leq η} .

U (η) = {Σ : η^{- 1} \leq η_{m i n} (Σ) \leq η_{m a x} (Σ) \leq η} .

R_{g} (ϕ; k_{0}, Δ, ρ) := θ^{*} \in B_{0} [k_{0}], Σ \in U (η), σ > 0 sup P_{θ^{*}, Σ, σ} [ϕ = 1] + σ > 0, θ^{*} \in B_{0} [k_{0} + Δ], d_{2} (θ^{*}; B_{0} [k_{0}]) \geq σ ρ, Σ \in U [η] sup P_{θ^{*}, Σ, σ} [ϕ = 0] .

R_{g} (ϕ; k_{0}, Δ, ρ) := θ^{*} \in B_{0} [k_{0}], Σ \in U (η), σ > 0 sup P_{θ^{*}, Σ, σ} [ϕ = 1] + σ > 0, θ^{*} \in B_{0} [k_{0} + Δ], d_{2} (θ^{*}; B_{0} [k_{0}]) \geq σ ρ, Σ \in U [η] sup P_{θ^{*}, Σ, σ} [ϕ = 0] .

ρ_{g, γ}^{*} [k_{0}, Δ] := ϕ in f ρ_{g, γ} (ϕ; k_{0}, Δ) .

ρ_{g, γ}^{*} [k_{0}, Δ] := ϕ in f ρ_{g, γ} (ϕ; k_{0}, Δ) .

\rho_{\gamma}^{*2}[0,\Delta]\asymp_{\gamma,\zeta}\min\left[\frac{\Delta\log\big{(}p\big{)}}{n},n^{-1/2}\right]\ ,

\rho_{\gamma}^{*2}[0,\Delta]\asymp_{\gamma,\zeta}\min\left[\frac{\Delta\log\big{(}p\big{)}}{n},n^{-1/2}\right]\ ,

ρ_{g, γ}^{* 2} [0, Δ]

ρ_{g, γ}^{* 2} [0, Δ]

ρ_{g, γ}^{* 2} [0, Δ]

ρ_{γ}^{* 2} [k_{0}, Δ] \leq c_{γ} [n^{- 1/2} + \frac{k _{0} lo g ( p )}{n}],

ρ_{γ}^{* 2} [k_{0}, Δ] \leq c_{γ} [n^{- 1/2} + \frac{k _{0} lo g ( p )}{n}],

\rho_{\gamma}^{*2}[k_{0},\Delta]\geq c_{1}\left\{\begin{array}[]{ccc}\min\Big{[}\frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log\big{[}1+\frac{\sqrt{p}}{k_{0}}\big{]},\frac{\Delta}{n}\log(1+\frac{\sqrt{p}}{\Delta})\Big{]}&\text{ if }&0\leq k_{0}\leq\sqrt{p}\wedge n\ ;\\ \frac{\Delta}{n}\frac{\log^{2}\big{[}1+\sqrt{\frac{k_{0}}{\Delta}}\big{]}}{\log(p)}&\text{ if }&\sqrt{p}<k_{0}<n\ .\end{array}\right.

\rho_{\gamma}^{*2}[k_{0},\Delta]\geq c_{1}\left\{\begin{array}[]{ccc}\min\Big{[}\frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log\big{[}1+\frac{\sqrt{p}}{k_{0}}\big{]},\frac{\Delta}{n}\log(1+\frac{\sqrt{p}}{\Delta})\Big{]}&\text{ if }&0\leq k_{0}\leq\sqrt{p}\wedge n\ ;\\ \frac{\Delta}{n}\frac{\log^{2}\big{[}1+\sqrt{\frac{k_{0}}{\Delta}}\big{]}}{\log(p)}&\text{ if }&\sqrt{p}<k_{0}<n\ .\end{array}\right.

ρ_{γ}^{* 2} [k_{0}, Δ] \geq c_{4} \frac{Δ \land k _{0}}{n} lo g (2 \lor \frac{p}{k _{0}}) e^{c_{5} \frac{k _{0}}{n} l o g (1 + \frac{p}{k _{0}})},

ρ_{γ}^{* 2} [k_{0}, Δ] \geq c_{4} \frac{Δ \land k _{0}}{n} lo g (2 \lor \frac{p}{k _{0}}) e^{c_{5} \frac{k _{0}}{n} l o g (1 + \frac{p}{k _{0}})},

\mathbf{T}^{(1)}=\Big{(}\mathbf{X}^{(1)}_{.,1}/\|\mathbf{X}^{(1)}_{.,1}\|_{2},\ldots,\mathbf{X}^{(1)}_{.,p}/\|\mathbf{X}^{(1)}_{.,p}\|_{2}\Big{)}\ .

\mathbf{T}^{(1)}=\Big{(}\mathbf{X}^{(1)}_{.,1}/\|\mathbf{X}^{(1)}_{.,1}\|_{2},\ldots,\mathbf{X}^{(1)}_{.,p}/\|\mathbf{X}^{(1)}_{.,p}\|_{2}\Big{)}\ .

θ_{S L, N} \in ar g min ∥ Y^{(1)} - T^{(1)} θ ∥_{2} + λ ∥ θ ∥_{1}; (θ_{S L})_{i} = (θ_{S L, N})_{i} /∥ X_{., i}^{(1)} ∥_{2}, i = 1, \dots, p .

θ_{S L, N} \in ar g min ∥ Y^{(1)} - T^{(1)} θ ∥_{2} + λ ∥ θ ∥_{1}; (θ_{S L})_{i} = (θ_{S L, N})_{i} /∥ X_{., i}^{(1)} ∥_{2}, i = 1, \dots, p .

\widetilde{\theta}_{\mathbf{I}}=\frac{1}{m}\mathbf{X}^{(2)T}\big{(}Y^{(2)}-\mathbf{X}^{(2)}\widehat{\theta}_{SL}\big{)}+\widehat{\theta}_{SL}\ .

\widetilde{\theta}_{\mathbf{I}}=\frac{1}{m}\mathbf{X}^{(2)T}\big{(}Y^{(2)}-\mathbf{X}^{(2)}\widehat{\theta}_{SL}\big{)}+\widehat{\theta}_{SL}\ .

\mathbb{B}_{0}[k_{0}+\Delta]\bigcap\Big{\{}\theta^{*},\quad|\theta^{*}_{(k_{0}+1)}|\geq c\sigma\sqrt{\frac{1}{n}\log\big{(}\frac{p}{\alpha\wedge\beta}\big{)}}\Big{\}}\ ,

\mathbb{B}_{0}[k_{0}+\Delta]\bigcap\Big{\{}\theta^{*},\quad|\theta^{*}_{(k_{0}+1)}|\geq c\sigma\sqrt{\frac{1}{n}\log\big{(}\frac{p}{\alpha\wedge\beta}\big{)}}\Big{\}}\ ,

\Big{\{}\theta^{*},\quad d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\geq c\sigma^{2}\Big{[}\frac{k_{0}\vee 1}{n}\log(p/\delta)+\sqrt{\frac{\log(2/(\alpha\wedge\beta))}{n}}\Big{]}\Big{\}}\ .

\Big{\{}\theta^{*},\quad d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\geq c\sigma^{2}\Big{[}\frac{k_{0}\vee 1}{n}\log(p/\delta)+\sqrt{\frac{\log(2/(\alpha\wedge\beta))}{n}}\Big{]}\Big{\}}\ .

\varphi(s;x)=\int_{-1}^{+1}(1-|\xi|)\cos\big{(}\xi sx\big{)}e^{\xi^{2}s^{2}/2}d\xi\ .

\varphi(s;x)=\int_{-1}^{+1}(1-|\xi|)\cos\big{(}\xi sx\big{)}e^{\xi^{2}s^{2}/2}d\xi\ .

\overline{θ}_{I, i} = θ_{I, i} 1_{{∣ θ_{I, i} ∣ > \underline{c}^{(t)} σ \frac{l o g ( 2 p / α )}{n}}} .

\overline{θ}_{I, i} = θ_{I, i} 1_{{∣ θ_{I, i} ∣ > \underline{c}^{(t)} σ \frac{l o g ( 2 p / α )}{n}}} .

Z_{f} := j = 1 \sum p 1_{(\overline{θ}_{I})_{j} = 0} φ (s; \frac{W _{j}}{∥ Y ^{(3)} ∥ _{2}}) + 1_{(\overline{θ}_{I})_{j} \neq = 0},

Z_{f} := j = 1 \sum p 1_{(\overline{θ}_{I})_{j} = 0} φ (s; \frac{W _{j}}{∥ Y ^{(3)} ∥ _{2}}) + 1_{(\overline{θ}_{I})_{j} \neq = 0},

∣ θ_{(k_{0} + q)}^{*} ∣

∣ θ_{(k_{0} + q)}^{*} ∣

\displaystyle\sum_{i=1}^{p}\Big{[}\theta_{i}^{*2}

η_{r, w} (x) = \frac{r}{( 1 - 2 Φ ( r ))} \int_{- 1}^{1} \frac{e ^{- r^{2} ξ^{2} /2}}{2 π} e^{ξ^{2} w^{2} /2} cos (ξ w x) d ξ .

η_{r, w} (x) = \frac{r}{( 1 - 2 Φ ( r ))} \int_{- 1}^{1} \frac{e ^{- r^{2} ξ^{2} /2}}{2 π} e^{ξ^{2} w^{2} /2} cos (ξ w x) d ξ .

V(r,w)=\sum_{j=1}^{p}{\mathbf{1}}_{(\overline{\theta}_{I})_{j}=0}\big{[}1-\eta_{r,w}(W_{j}/\|\overline{Y}^{(3)}\|_{2})\big{]}+{\mathbf{1}}_{(\overline{\theta}_{I})_{j}\neq 0}\ .

V(r,w)=\sum_{j=1}^{p}{\mathbf{1}}_{(\overline{\theta}_{I})_{j}=0}\big{[}1-\eta_{r,w}(W_{j}/\|\overline{Y}^{(3)}\|_{2})\big{]}+{\mathbf{1}}_{(\overline{\theta}_{I})_{j}\neq 0}\ .

r_{l} = 2 lo g (\frac{k _{0}}{l}); w_{l} = lo g (\frac{l}{p}) .

r_{l} = 2 lo g (\frac{k _{0}}{l}); w_{l} = lo g (\frac{l}{p}) .

V(r_{l},w_{l})\geq k_{0}+l+v^{i}_{\alpha,l}\,\quad\quad\text{ where }\quad v^{i}_{\alpha,l}=\frac{e^{1/2}}{2}\omega_{l}^{2}+\sqrt{2ln^{1/2}\log\Big{(}\frac{\pi^{2}[1+\log_{2}(l/l_{0})]^{2}}{6\alpha}\Big{)}}\ .

V(r_{l},w_{l})\geq k_{0}+l+v^{i}_{\alpha,l}\,\quad\quad\text{ where }\quad v^{i}_{\alpha,l}=\frac{e^{1/2}}{2}\omega_{l}^{2}+\sqrt{2ln^{1/2}\log\Big{(}\frac{\pi^{2}[1+\log_{2}(l/l_{0})]^{2}}{6\alpha}\Big{)}}\ .

∣ θ_{(k_{0} + q)}^{*} ∣ \geq c_{α} σ \frac{1 + lo g ( \frac{k _{0}}{q \land k _{0}} )}{n lo g ( 1 + \frac{k _{0}}{p} )}, for some q \geq c_{α}^{'} k_{0}^{4/5} p^{1/10} .

∣ θ_{(k_{0} + q)}^{*} ∣ \geq c_{α} σ \frac{1 + lo g ( \frac{k _{0}}{q \land k _{0}} )}{n lo g ( 1 + \frac{k _{0}}{p} )}, for some q \geq c_{α}^{'} k_{0}^{4/5} p^{1/10} .

ϕ^{(a g)} = max (ϕ^{(t)}, ϕ^{(χ)}, ϕ^{(f)}, ϕ^{(i)}, 1 {d_{2}^{2} (θ_{S L}, B_{0} [k_{0}]) \geq σ^{2} /2}) .

ϕ^{(a g)} = max (ϕ^{(t)}, ϕ^{(χ)}, ϕ^{(f)}, ϕ^{(i)}, 1 {d_{2}^{2} (θ_{S L}, B_{0} [k_{0}]) \geq σ^{2} /2}) .

\rho^{2}_{k_{0},\Delta,\varsigma}=\left\{\begin{array}[]{ccc}\min\Big{[}\frac{\Delta}{n}\log(p),\frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log(p)\Big{]}&\text{ if }&0\leq k_{0}\leq p^{1/2-\varsigma}\ ;\\ \min[\frac{\Delta\log(p)}{n},\frac{k_{0}}{n\log(p)}\big{]}&\text{ if }&k_{0}>p^{1/2+\varsigma}\ .\\ \end{array}\right.

\rho^{2}_{k_{0},\Delta,\varsigma}=\left\{\begin{array}[]{ccc}\min\Big{[}\frac{\Delta}{n}\log(p),\frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log(p)\Big{]}&\text{ if }&0\leq k_{0}\leq p^{1/2-\varsigma}\ ;\\ \min[\frac{\Delta\log(p)}{n},\frac{k_{0}}{n\log(p)}\big{]}&\text{ if }&k_{0}>p^{1/2+\varsigma}\ .\\ \end{array}\right.

\mathbb{B}_{0}[k_{0}+\Delta]\cap\big{\{}\theta^{*},\ d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\geq c_{\delta,\varsigma}\rho^{2}_{k_{0},\Delta,\varsigma}\big{\}}

\mathbb{B}_{0}[k_{0}+\Delta]\cap\big{\{}\theta^{*},\ d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\geq c_{\delta,\varsigma}\rho^{2}_{k_{0},\Delta,\varsigma}\big{\}}

\rho_{\gamma}^{*2}[k_{0},\Delta]\asymp_{\gamma,\kappa,\varsigma}\left\{\begin{array}[]{cc}\frac{\Delta}{n}\log(p)&\text{ if }\Delta\leq\frac{\sqrt{n}}{\log(p)}+k_{0}\ ;\\ \frac{1}{\sqrt{n}}+\frac{k_{0}\log(p)}{n}&\text{ if }\Delta>\frac{\sqrt{n}}{\log(p)}+k_{0}\ .\end{array}\right.

\rho_{\gamma}^{*2}[k_{0},\Delta]\asymp_{\gamma,\kappa,\varsigma}\left\{\begin{array}[]{cc}\frac{\Delta}{n}\log(p)&\text{ if }\Delta\leq\frac{\sqrt{n}}{\log(p)}+k_{0}\ ;\\ \frac{1}{\sqrt{n}}+\frac{k_{0}\log(p)}{n}&\text{ if }\Delta>\frac{\sqrt{n}}{\log(p)}+k_{0}\ .\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Optimal Sparsity Testing in Linear regression Model

Alexandra Carpentier and Nicolas Verzelen

Abstract

We consider the problem of sparsity testing in the high-dimensional linear regression model. The problem is to test whether the number of non-zero components (aka the sparsity) of the regression parameter $\theta^{*}$ is less than or equal to $k_{0}$ . We pinpoint the minimax separation distances for this problem, which amounts to quantifying how far a $k_{1}$ -sparse vector $\theta^{*}$ has to be from the set of $k_{0}$ -sparse vectors so that a test is able to reject the null hypothesis with high probability. Two scenarios are considered. In the independent scenario, the covariates are i.i.d. normally distributed and the noise level is known. In the general scenario, both the covariance matrix of the covariates and the noise level are unknown. Although the minimax separation distances differ in these two scenarios, both of them actually depend on $k_{0}$ and $k_{1}$ illustrating that for this composite-composite testing problem both the size of the null and of the alternative hypotheses play a key role.

1 Introduction

In the last decade, a lot of effort has been devoted to developing sound statistical methods for high-dimensional data. Most of the estimation procedures rely on the assumption that the parameter of interest has some possibly unknown structure. A prominent example is the high-dimensional linear regression problem where it is usually assumed that the regression parameter is sparse [7]. Despite the pervasiveness of the sparsity assumption in the literature, very few contributions challenge this assumption.

In this work, we tackle the largely ignored problem of assessing the sparsity of the regression parameter. Henceforth, we consider the random design high-dimensional linear regression model

[TABLE]

where the unknown parameter $\theta^{*}$ belongs to $\mathbb{R}^{p}$ , the noise vector $\epsilon\in\mathbb{R}^{n}$ follows a standard normal distribution and where the rows of $\mathbf{X}$ are i.i.d. sampled according to the normal distribution $\mathcal{N}(0,{\boldsymbol{\Sigma}})$ . For a given integer $k_{0}$ , we study the problem of testing whether the vector $\theta^{*}$ has at most $k_{0}$ non-zero components.

1.1 Minimax separation distance

Before discussing our contribution, we first formalize the sparsity testing problem. For a vector $\theta$ , $\|\theta\|_{0}$ denotes its number of non-zero entries. Then, given a non-negative integer $k_{0}\in[0,p]$ , write $\mathbb{B}_{0}[k_{0}]=\{\theta\in\mathbb{R}^{p}:\|\theta\|_{0}\leq k_{0}\}$ for the set of $k_{0}$ -sparse vectors $\theta$ . Rephrasing our aim, we want to test whether $\theta^{*}$ belongs to $\mathbb{B}_{0}[k_{0}]$ .

In order to assess the quality of a testing procedure, we rely on the framework of minimax separation distances [29] which is described in the following paragraphs. Let $\|.\|_{2}$ denote the $l_{2}$ distance in $\mathbb{R}^{p}$ . For any $\theta^{*}\in\mathbb{R}^{p}$ , $d_{2}(\theta^{*},\mathbb{B}_{0}[k_{0}]):=\inf_{u\in\mathbb{B}_{0}[k_{0}]}\|\theta-u\|_{2}$ stands for its $l_{2}$ distance to the set of $k_{0}$ -sparse vectors. Intuitively, any $\alpha$ -level test $\phi$ of the null hypothesis $\{\theta^{*}\in\mathbb{B}_{0}[k_{0}]\}$ cannot reject the null with high probability when $d_{2}(\theta^{*},\mathbb{B}_{0}[k_{0}])$ is too small. In this work, we aim at characterizing the smallest distance $\rho$ , such that there exists a test achieving a small type I error probability and rejecting the null with high probability whenever $d_{2}(\theta^{*},\mathbb{B}_{0}[k_{0}])$ is larger than $\rho\sigma$ . These informal definitions are made precise in the next subsection. In the sequel, $\operatorname{\mathbb{P}}_{\theta^{*},{\boldsymbol{\Sigma}},\sigma}$ stands for the distribution of $(Y,\mathbf{X})$ in (1).

In high-dimensional linear regression, the intrinsic difficulty of estimation or testing problems sometimes depends on some specific features such as the knowledge of the noise level $\sigma^{2}$ or the knowledge of the distribution of the design. In this work, we focus on two emblematic settings. In the independent setting, we assume that the covariates are independent ( ${\boldsymbol{\Sigma}}=\mathbf{I}_{p}$ ) and the noise level $\sigma$ is known. In the general setting, both the covariance of the covariates and the noise level are unknown.

1.1.1 Independent setting

Fix a positive integer $1\leq\Delta\leq p-k_{0}$ , we consider the alternative hypothesis where $\theta^{*}$ is $k_{0}+\Delta$ -sparse. Given $\rho>0$ and a test $\phi$ , we introduce its risk $R(\phi;k_{0},\Delta,\rho)$ as the sum of the type I and type II error probabilities

[TABLE]

where we only consider parameters $\theta^{*}$ in the alternative hypothesis that lie at a distances $d_{2}$ higher than $\rho\sigma$ from the null. For a fixed (known) $\gamma\in(0,1)$ , the separation distance $\rho_{\gamma}[\phi;k_{0},\Delta]$ of $\phi$ is the largest $\rho$ such that its risk is higher than $\gamma$ , i.e. $\rho_{\gamma}(\phi;k_{0},\Delta):=\sup\left\{\rho>0\ |R(\phi;k_{0},\Delta,\rho)>\gamma\right\}$ . Parameters $\theta^{*}$ lying at a distance larger than $\sigma\rho_{\gamma}(\phi;k_{0},\Delta)$ from the null are therefore detected with probability higher than 1- $\gamma$ by $\phi$ . Finally, the minimax separation distance is

[TABLE]

where the infimum is taken over all tests $\phi$ .

1.1.2 General setting

In the general case, neither the covariance matrix ${\boldsymbol{\Sigma}}$ of the covariates, nor the noise level $\sigma$ is known. We only assume that the the eigenvalues of ${\boldsymbol{\Sigma}}$ are bounded away from zero and from infinity. Respectively write $\eta_{\min}({\boldsymbol{\Sigma}})$ and $\eta_{\max}({\boldsymbol{\Sigma}})$ for its smallest and largest eigenvalues. Given $\eta>1$ , define

[TABLE]

Fix $\rho>0$ . In this general model, the risk of a test $\phi$ is now taken as

[TABLE]

Since both ${\boldsymbol{\Sigma}}$ and $\sigma$ are unknown, we evaluate the type I and type II error probabilities uniformly over all $\sigma>0$ and all ${\boldsymbol{\Sigma}}\in\mathcal{U}[\eta]$ . The class of covariance matrices is constrained in $\mathcal{U}[\eta]$ in order to preclude too difficult settings where the eigenvalues of ${\boldsymbol{\Sigma}}$ differ too much to each other. Then, as in the previous subsection, the separation distance of a test $\phi$ is $\boldsymbol{\rho}_{g,\gamma}(\phi;k_{0},\Delta):=\sup\left\{\rho>0\ |\mathbf{R}_{g}(\phi;k_{0},\Delta,\rho)>\gamma\right\}$ and the minimax separation distance in the general setting is defined by

[TABLE]

In this work, we address both independent and general settings. More specifically,

(i)

We characterize the minimax separation distances in both the independent ( $\rho^{*}_{\gamma}[k_{0},\Delta]$ ) and the general ( $\boldsymbol{\rho}^{*}_{g,\gamma}[k_{0},\Delta]$ ) settings by providing upper and lower bounds that match (up to a polylogarithmic loss in some regimes). 2. (ii)

We introduce computationally feasible testing procedures that (almost) simultaneously achieve this minimax separation distance over all $\Delta$ .

1.2 Previous results and related literature

Before further describing our contribution, we first discuss related results in the literature.

Signal detection.

The signal detection problem which amounts to testing whether $\theta^{*}=0$ is a special instance of the sparsity testing problem (corresponding to $k_{0}=0$ ). Signal detection in the Gaussian vector model (which corresponds to an orthogonal design) has been extensively studied [29, 2, 23, 18, 19] in the last fifteen years. More recently, this problem has also been investigated in the random design linear regression model [28, 1, 16].

To simplify the discussion, let us consider the high-dimensional setting where $p\geq n^{1+\zeta}$ for some fixed constant $\zeta>0$ . Then, one can deduce from [28] that the minimax separation distance in the independent setting satisfies

[TABLE]

where $f(\Delta,n,p)\asymp_{\gamma,\zeta}g(\Delta,n,p)$ means that there exist positive constants $c_{\gamma,\zeta}$ and $c^{\prime}_{\gamma,\zeta}$ (possibly depending on $\gamma$ and $\zeta$ ) such that $f(\Delta,n,p)\leq c_{\gamma}g(\Delta,n,p)\leq c^{\prime}_{\gamma}f(\Delta,n,p)$ for all $\Delta$ , $n$ , and $p$ . For $\Delta\leq\sqrt{n}/\log(p)$ , this separation distance is achieved by measuring the raw correlations between the response and the covariates and rejecting when too many of these correlations are unusually large. This can be done through the Higher-Criticism scheme [28, 1]. For denser alternatives ( $\Delta\geq\sqrt{n}/\log(p)$ ), we start from the identity $\mathbb{E}Y_{i}^{2}=\sigma^{2}+\|\theta^{*}\|_{2}^{2}$ (and the $Y_{i}$ are i.i.d.). Hence, a test rejecting when the empirical mean of $Y_{i}^{2}$ is significantly larger than $\sigma^{2}$ achieves the optimal squared separation distance of order $n^{-1/2}$ [28, 1]. In the specific regime where $p$ is of the same order as $n$ , and $\Delta$ is close to $\sqrt{n}$ , the analysis has to be refined, see [16].

In the general setting (unknown ${\boldsymbol{\Sigma}}$ and unknown $\sigma$ ), it has been proved in [45] that,

[TABLE]

However, for sparse alternatives, the corresponding test in [45] relies on a $l_{0}$ type variable selection method and has therefore exponential computationally complexity. For denser alternatives $(\Delta\geq\sqrt{p})$ , the lower bound entails that the minimax separation distance is large whenever $p\geq n^{2}$ . Comparing both the independent and the general settings, we observe that the separation distance is significantly larger in the general setting for dense alternatives $\Delta\geq\sqrt{n}/\log(p)$ .

Composite-composite testing problems and related work.

An important difference between the signal detection ( $k_{0}=0$ ) problem and the general sparsity testing problem ( $k_{0}>0$ ) is that, in the latter, the null hypothesis is composite, thereby making the analysis of the problem more challenging. Up to our knowledge, the analysis of such composite problems has been considered only in a few work [35, 3, 20, 15], although the problems of constructing adaptive confidence regions (e.g. [12, 13, 27, 39, 9, 8]) or of functional estimation (e.g. [38, 25, 14, 11, 10]) are also related to such testing problems.

In particular, Nickl and Van de Geer [39] consider the problem of constructing adaptive and honest confidence sets for $\theta^{*}$ in the linear regression model (1) with known variance $\sigma^{2}$ . To achieve adaptivity to the unknown sparsity of $\theta^{*}$ , Nickl and van de Geer need to test hypotheses of the form $\|\theta^{*}\|_{0}\leq k_{0}$ . Following the so-called “infimum testing” principle, described in a systematic way in [26], they consider the statistic $\inf_{\theta\in\mathbb{B}_{0}[k_{0}]}\|Y-\mathbf{X}\theta\|_{2}^{2}/n$ . This statistic corresponds to the infimum of the empirical variance when one corrects $Y$ by a $k_{0}$ -sparse vector $\theta$ . Under the null, this statistic is not much larger than the noise level $\sigma^{2}$ . This leads them to derive

[TABLE]

for some $c_{\gamma}>0$ . Comparing this bound with its counterpart in the signal detection problem ( $k_{0}=0$ ), we observe an increase by an additive term $\tfrac{k_{0}\log(p)}{n}$ accounting for the complexity of the null hypothesis.

Up to our knowledge, it is still unknown whether the upper bound of Nickl and van de Geer is optimal (that is whether $\rho_{\gamma}^{*2}[k_{0},\Delta]$ actually depends on $k_{0}\log(p)/n$ ). In this manuscript, we answer this open question, this for all $k_{0}$ and $\Delta$ .

Sparsity testing in the Gaussian sequence model.

The Gaussian sequence model $Y=\theta^{*}+\sigma\epsilon$ corresponds to case $p=n$ and $\mathbf{X}=\mathbf{I}_{p}$ . In [17], we have pinpointed the minimax separation distances for all $k_{0}$ and $\Delta$ both when $\sigma$ is known and $\sigma$ is unknown. In particular, the optimal separation distance actually depends on the size $k_{0}$ of the null hypothesis for large $k_{0}$ but is significantly smaller than what is obtained by infimum tests strategies such as those in [26].

Generally speaking, [17] is closely related to the aims and results of this paper, but there is a significant challenge in adapting the results in [17] which are available for the Gaussian sequence setting, to the linear regression setting.

Related to this problem, some authors [11, 33, 34, 10] have considered the problem of estimating $\|\theta^{*}\|_{0}$ in the Gaussian sequence model in a Bayesian framework where all $\theta^{*}_{i}$ ’s are sampled according to some mixture distribution. Although some of the ideas can be borrowed from their work, this Bayesian setting is quite different (see [17] for a discussion).

1.3 Our results

In this paper, we characterize the minimax separation distances $\rho^{*}_{\gamma}[k_{0},\Delta]$ and $\boldsymbol{\rho}_{g,\gamma}^{*2}[0,\Delta]$ . To alleviate the discussion, we restrict ourselves throughout this paper to the high dimensional regime $p\geq n^{1+\zeta}$ where $\zeta>0$ is an arbitrarily small absolute constant.

Independent setting.

We establish matching (up to a multiplicative constants depending on $\gamma$ ) upper and lower bounds for $\rho^{*}_{\gamma}[k_{0},\Delta]$ , this for almost all values of $k_{0}$ and $\Delta$ ; see Table 1 for a summary of these results. An aggregated test is also shown to simultaneously achieve the optimal separation distance for all $\Delta>0$ , entailing that adaptation to the sparsity is possible for this problem. In our exhaustive picture of $\rho^{*}_{\gamma}[k_{0},\Delta]$ , some of the regimes in $k_{0}$ and $\Delta$ are addressed by simple extensions of signal detection tests. However, other regimes turn out to be more challenging and require novel ideas. In what follows, we briefly mention these original aspects.

•

We prove that, when $k_{0}\geq cn/\log(p)$ , then the testing problem becomes extremely difficult, in the sense that the separation distance $\rho^{*}_{\gamma}[k_{0},\Delta]$ is very large. For $k_{0}\geq n$ , this separation distance is even infinite. This is not unexpected since identifiability problems arise in this regime.

•

For moderate $k_{0}\in[\frac{\sqrt{n}}{\log(p)},p^{1/2-\zeta}]$ and large $\Delta$ , we prove that the upper bound of Nickl and Van de Geer [39] turns out to be optimal, i.e. the squared minimax separation distance is achieved by their infimum test and is of the order of $\frac{k_{0}\log(p)}{n}$ . The general idea is to reduce the problem of sparsity testing (with known variance) to a detection problem with unknown variance.

•

For larger $k_{0}\in[p^{1/2-\zeta},cn/\log(p)]$ (where $\zeta>0$ is an arbitrarily small absolute constant, and where $c>0$ is an absolute constant), then both upper and lower bounds are new. The lower bound is based on moment matching strategies and best polynomial approximation akin to those of [17] in the Gaussian model. But the derivation is significantly more involved in the regression setting. For small $\Delta$ ( $\Delta\leq k_{0}$ ), an optimal test is built using any estimator of $\theta^{*}$ achieving a small $l_{\infty}$ error (see e.g. [31, 51, 44, 32]). The test simply rejects when this estimator has more than $k_{0}$ unusually large entries. For denser alternatives ( $\Delta\geq k_{0}$ ), the approach is quite different. We build a statistic based on the empirical Fourier transform of some correction of the raw correlations between the covariates and the responses $Y$ . This approach is reminiscent of sparsity estimators in [33, 17] in the Gaussian sequence model.

General setting.

We derive lower and upper bounds of the minimax separation distance $\boldsymbol{\rho}^{*}_{g,\gamma}[k_{0},\Delta]$ . These bounds match except in the large $k_{0}$ and $\Delta$ regime, where there is a $\log^{2}(p)$ mismatch. See Table 2 for a summary of the results. As in the independent setting, we emphasize below the most novel ingredient of our analysis.

•

Achieving the optimal squared distance $\Delta\log(p)/n$ could be easily done if one has access to an estimator whose $l_{\infty}$ distance to $\theta^{*}$ is less than $\sigma\sqrt{\log(p)/n}$ with high probability. However, such an estimator is unknown for general covariance matrices ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ . For $\|\theta^{*}\|_{0}\geq\sqrt{n}$ , it is even proved that no such estimator exists [9]. Here, we first select a reasonable candidate for the support of $\theta^{*}$ by relying on the non-convex penalized least-square estimator MCP [49]. Then, a test based on the restricted least-squares estimator applied to the selected subset is shown to achieve the desired separation distance. We also introduce an alternative test based on an iterative application of a projected version of the square-root Lasso.

1.4 Other related work

Two recent works [52, 30] have among other things consider general testing problems that encompass the sparsity testing problem. These two contributions assess the quality of their tests according to the $l_{\infty}$ separation distance (instead of $l_{2}$ as we do here) to the null hypothesis, i.e. $d_{\infty}(\theta^{*};\mathbb{B}_{0}[k_{0}])=\inf_{\theta\in\mathbb{B}_{0}[k_{0}]}\|\theta^{*}-\theta\|_{\infty}$ . In their setting, the covariance ${\boldsymbol{\Sigma}}$ of the covariates is unknown but its inverse ${\boldsymbol{\Sigma}}^{-1}$ is assumed to be sparse (each row of ${\boldsymbol{\Sigma}}^{-1}$ has at most than $n/\log(p)$ non-zero entries) so that it can be reasonably well estimated. In that setting, the computationally feasible test in [52] has a small type II error probability when $k_{0}\log(p)$ is much smaller than $n^{1/4}$ and when $d_{\infty}(\theta^{*};\mathbb{B}_{0}[k_{0}])\geq c\sigma n^{-1/4}$ .

In [30], Javanmard and Lee use a test based on the debiased Lasso. It achieves a small type I error probability. Whenever $(k_{0}+\Delta)\log(p)$ is much smaller than $\sqrt{n}$ , and also $d_{\infty}(\theta^{*};\mathbb{B}_{0}[k_{0}])\geq c\sigma\sqrt{\log(p)/n}$ , its type II error probability is also small. Translating these results in the $l_{2}$ separation distance setting, we observe that this test achieves a squared separation distance $\Delta\log(p)/n$ which, in view of Table 2, is optimal for small $\Delta$ . Their approach could be used instead of ours in their setting. However, we stress out that they achieve this bound to the price of considering a much more restricted class of covariance matrices than $\mathcal{U}(\eta)$ - they need that each row of ${\boldsymbol{\Sigma}}^{-1}$ is at most $n/\log(p)$ sparse, while $\mathcal{U}(\eta)$ contains all matrices ${\boldsymbol{\Sigma}}$ that have their spectrum contained in $[\eta^{-1},\eta]$ .

A recent line of work has focused on testing the nullity of a given subset of coordinates of $\theta^{*}$ (e.g. [53, 54, 6, 32, 44, 51, 9]), but both the settings and the methodology are quite different.

1.5 Notation

For any positive integer $d$ and $u\in\mathbb{R}^{d}$ , we write $\mathcal{S}(u)=\{i:u_{i}\neq 0\}$ for the support of a vector $u$ . For $u\in\mathbb{R}^{d}$ and $S\subset[d]$ , we write $u_{S}=(u_{i}{\mathbf{1}}_{i\in S})_{i}$ for the vector in $\mathbb{R}^{d}$ whose values outside $S$ have been set to [math]. For a vector $\gamma$ , $\gamma_{(i)}$ stands for its $i$ -th largest (in absolute value) entry. Given $S\subset\{1,\ldots,p\}$ , $\overline{S}$ stands for its complement.

In the sequel, $c$ , $c_{1}$ , $c^{\prime}$ denote numerical positive constants that may vary from line to line. Given some quantity $\delta$ , $c_{\delta}$ stands for a positive constant possibly depending on $\delta$ that may vary from line to line. Underlined constant such as $\underline{c}$ , $\underline{c}^{(1)}$ do not vary in the paper.

Let $a,b\in\mathbb{R}$ be two functions that may depend on several quantities such as $n,p,\Delta,k_{0}$ and let $u\in\mathbb{R}$ . We write $a\lesssim_{u}b$ (resp. $a\approx_{u}b$ ) if there exists a constant $c_{u}>0$ that depends only on $u$ (resp. two constants $c_{u}^{+},c_{u}^{-}>0$ that depend only on $u$ ) such that $a\leq c_{u}b$ (resp. such that $c_{u}^{-}b\leq a\leq c_{u}^{+}b$ ).

For $x>0$ , $\lfloor x\rfloor$ (resp. $\lceil x\rceil$ ) stands for the largest (resp. smallest) integer which is less (resp. greater) or equal to $x$ . Also, $\log_{2}$ stands for the binary logarithm. Finally, $\overline{\Phi}$ stands for the tail distribution function of a standard normal distribution.

2 Independent setting

To simplify the notation, we denote $\operatorname{\mathbb{P}}_{\theta^{*},\sigma}$ the distribution of the data when ${\boldsymbol{\Sigma}}$ is the identity matrix. Recall that we are especially interested in the high-dimensional setting. This is why we shall sometimes assume that $p\geq n$ or even $p\geq n^{1+\zeta}$ for some $\zeta>0$ arbitrarily small.

2.1 Minimax lower bound

As a starting point, we prove that, when the size $k_{0}$ of the null hypothesis is too large, consistent testing is impossible. Indeed, assume that $k_{0}\geq n$ . Then, for any $(Y,\mathbf{X})\in\mathbb{R}^{n}\times\mathbb{R}^{n\times p}$ such that $\mathrm{Rank}(\mathbf{X})\geq n$ , there exists $\theta\in\mathbb{B}_{0}[k_{0}]$ that perfectly fits this sample ( $Y=\mathbf{X}\theta$ ) and it is therefore impossible to decipher whether $\theta^{*}$ is $k_{0}$ -sparse or not. The following proposition formalizes this observation.

Proposition 1.

If $k_{0}\geq n$ , then, for any $\gamma<1/2$ , and $1\leq\Delta\leq p-k_{0}$ , we have $\rho^{*}_{\gamma}[k_{0},\Delta]=\infty$ .

In the sequel, we therefore restrict ourselves to the case where $k_{0}<n$ . The next theorem provides a lower bound for the minimax separation distance of the sparsity testing problem.

Theorem 1.

Assume that $p\geq 2n$ . There exist positive numerical constants $c_{1}$ – $c_{5}$ such that the following holds for all $\gamma\leq 0.06$ and for all $p\geq c_{1}$ . For $1\leq\Delta\leq p-k_{0}$ , one has

[TABLE]

Furthermore, if $p\geq c_{2}n^{2}$ and $k_{0}\geq c_{3}n/\log(\sqrt{p}/n)$ , then

[TABLE]

for all $1\leq\Delta\leq p-k_{0}$ .

In particular, (7) entails that the sparsity testing problem turns out to be extremely difficult in the regime $n/\log(p)\lesssim k_{0}\lesssim n$ (at least when $p\geq n^{2}$ ).

The different regimes in (6) will be discussed together with the upper bounds at the end of the section. Let us shortly comment on the proof of Theorem 1. The functional $\rho^{*}_{\gamma}[k_{0},\Delta]$ is (almost) nondecreasing with respect to $k_{0}$ . As a consequence, the lower bound $\frac{\Delta}{n}\log(1+\frac{\sqrt{p}}{\Delta})$ is a straightforward consequence of the analysis of the detection problem e.g. in [28].

The two lower bounds $\frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log\big{[}1+\frac{\sqrt{p}}{k_{0}}\big{]}$ and (7) are based on a reduction argument. The proof stems from the fact it is impossible to decipher between two sets of hypothesis if these two sets of hypotheses are almost indistinguishable from a third party hypothesis. Here, the third party hypothesis corresponds to $\theta^{*}=0$ and a tailored noise variance $\sigma^{\prime}>\sigma$ . Plugging minimax lower bounds for detection with unknown variance allows us to get the desired rate. See the proof for more details.

In fact, it is most challenging to prove the minimax lower bound in the regime $\Delta>k_{0}>\sqrt{p}$ as we cannot apply any reduction technique to signal detection problem and we need to take into account that both the null and the alternative hypotheses are composite. As for the Gaussian sequence model [17], we use a general moment matching technique [38], but the non-orthogonal design matrix $\mathbf{X}$ makes the computations more tricky.

2.2 Testing procedures

In this subsection, we fix $\alpha$ and $\delta\in(0,1)$ . We now introduce three testing procedures whose combination leads to matching the previous minimax lower bound.

Without loss of generality, we assume that $n$ is divisible by $3$ and we divide the sample $(Y,\mathbf{X})$ into three subsamples $(Y^{(1)},\mathbf{X}^{(1)})$ and $(Y^{(2)},\mathbf{X}^{(2)})$ and $(Y^{(3)},\mathbf{X}^{(3)})$ of equal size $m=n/3$ . For $i=1,2,3$ , we write $\operatorname{\mathbb{P}}_{\theta^{*},\sigma}^{(i)}$ for the probability according to the $i$ -th sub-sample. In fact, some of the tests introduced below only use the first two subsamples. Nevertheless, we use three subsamples throughout the paper to simplify the presentation.

To characterize the performances of the testing procedures, we shall control the type I error probability uniformly over the null hypothesis and control the type II error probability on some ’large’ parameter subset of the alternative. To simplify the statements of the results we shall refer to these two properties as (P1) and (P2) as defined below.

Property P1. A test $\phi$ satisfies (P1[ $\alpha$ ]) if its type I error probability is less than or equal to $\alpha$ , that is $\sup_{\theta^{*}\in\mathbb{B}_{0}[k_{0}]}\operatorname{\mathbb{P}}_{\theta^{*},\sigma}[\phi=1]\leq\alpha$

Property P2. A test $\phi$ satisfies (P2[ $\beta$ ]) on a set $\Theta$ if its type II error probability is uniformly less than or equal to $\beta$ , uniformly on $\Theta$ , that is $\inf_{\theta^{*}\in\Theta}\operatorname{\mathbb{P}}_{\theta^{*},\sigma}[\phi=1]\geq 1-\beta$

Following the discussion in the previous subsection, we restrict our attention to sparsities $k_{0}$ that are less than $n/\log(p)$ . This is formalized in the following condition ( $\mathbf{A}[\alpha]$ ) where $\underline{c}^{(\bf A)},\underline{c}^{(\bf A)^{\prime}}$ are numerical constants (respectively small enough for $\underline{c}^{(\bf A)}$ and large enough for $\underline{c}^{(\bf A)^{\prime}}$ ) whose values are constrained in Propositions 2–5.

( $\mathbf{A}[\alpha]$ )

$(k_{0}\vee 1)\log(\frac{p}{\alpha})+\log^{2}\big{(}\frac{p}{\alpha}\big{)}\leq\underline{c}^{(\bf A)}n$ and $p\geq\underline{c}^{(\mathbf{A})^{\prime}}\ .$

2.2.1 Test $\phi^{(t)}$ based on a $l_{\infty}$ estimation of $\theta^{*}$

The first test aims at detecting whether $\theta^{*}$ contains at least $k_{0}+1$ ’large’ entries. In order to do so, we need to build a reasonable $l_{\infty}$ estimator of $\theta^{*}$ . Note that estimators based on the debiased Lasso have already been proved to achieve such a property (see e.g. [32]) in some settings. For the sake of completeness and as a gentle introduction to more challenging settings, we introduce here a slightly different estimator.

As a first step, we rely on a square-root Lasso [4] estimator based on the first subsample. From the design matrix $\mathbf{X}^{(1)}$ , we build its column normalized modification $\mathbf{T}^{(1)}$ by

[TABLE]

Set $\lambda=2\sqrt{\overline{\Phi}^{-1}(\delta/(4p))}$ . The square-root Lasso estimator is then defined by

[TABLE]

In this section, we could replace the square-root Lasso estimator by a classical Lasso estimator since the noise level $\sigma$ is known. Also, the design is normalized for the purpose of simplifying some proof arguments, but the results remain valid (with slightly different constants) with the unnormalized design matrix $\mathbf{X}^{(1)}$ .

Then, given $\widehat{\theta}_{SL}$ , we use the second sample to improve the estimation of $\theta^{*}$ . The estimator $\widetilde{\theta}_{\mathbf{I}}$ is based on the empirical raw correlations between the covariates and the residuals.

[TABLE]

Since the design is independent, $\widetilde{\theta}_{\mathbf{I}}$ is an unbiased estimator of $\theta^{*}$ . It is not hard to show (see the proof of the next proposition) that, under weak assumptions, this estimator satisfies has $\|\widetilde{\theta}_{\mathbf{I}}-\theta^{*}\|_{\infty}\lesssim c\sigma\sqrt{\log(p)/n}$ with high probability. This is why we define the test $\phi^{(t)}$ rejecting the null if $\big{|}(\widetilde{\theta}_{\mathbf{I}})_{(k_{0}+1)}\big{|}\geq\underline{c}^{(t)}\sigma\sqrt{\log(p/\alpha)/n}$ , where a suitable value for the constant $\underline{c}^{(t)}$ is defined in the proof of Proposition 2 below. This test is powerful when $\theta^{*}$ contains at least $k_{0}+1$ large entries. This is formalized in the following proposition.

Proposition 2.

There exist numerical constants $\underline{c}^{(t)}$ , $c$ and $c^{\prime}$ such that the following holds under Condition ( $\mathbf{A}[\alpha\wedge\beta\wedge\delta]$ ). The test $\phi^{(t)}$ satisfies (P1[ $\alpha+\delta$ ]) and (P2[ $\beta+\delta$ ]) on the collections

[TABLE]

with $1\leq\Delta\leq c^{\prime}n/\log(p/\delta)$ .

Again, we emphasize that similar performances are achieved by the debiased Lasso test of Javanmard and Lee [30].

2.2.2 Test $\phi^{(\chi)}$ based on the $l_{2}$ norm of the residuals

The second test is also simple. We heavily rely on the knowledge of the noise level $\sigma$ . In the detection setting $(k_{0}=0)$ , [28, 1] consider a test rejecting the null when the squared norm $\|Y\|_{2}^{2}/(n\sigma^{2})$ is large compared to one. Indeed, in expectation, $\|Y\|_{2}^{2}/(n\sigma^{2})$ is equal to $\|\theta^{*}\|_{2}^{2}/\sigma^{2}+1$ . Here, we have to adapt this statistic as $\|\theta^{*}\|_{2}^{2}$ is unknown under the null.

First, we project the square-root Lasso estimator $\widehat{\theta}_{SL}$ onto the parameter set corresponding to the null hypothesis. More precisely, we define $\widetilde{\theta}_{SL,k_{0}}=\arg\min_{\theta\in\mathbb{B}_{0}[k_{0}]}\|\widehat{\theta}_{SL}-\theta\|_{2}^{2}$ . In other words, $\widetilde{\theta}_{SL,k_{0}}$ is obtained from $\widehat{\theta}_{SL}$ by thresholding its $(p-k_{0})$ smallest entries to zero. Then, given $\widetilde{\theta}_{SL,k_{0}}$ , we use the second sample to assess whether $\theta^{*}$ is significantly different from $\widetilde{\theta}_{SL,k_{0}}$ . Define the residuals vectors $\widehat{R}_{k_{0}}=Y^{(2)}-\mathbf{X}^{(2)}\widetilde{\theta}_{SL,k_{0}}$ and, for $R\in\mathbb{R}^{m}$ , the statistic $Z_{\chi}[R]=\frac{\|R\|_{2}^{2}}{m\sigma^{2}}-1$ .

Take the threshold $v_{\alpha,\delta}^{(\chi)}=\sqrt{\frac{\log(1/\alpha)}{m}}+\frac{(k_{0}\vee 1)\log(p/\delta)}{m}$ , we consider the test $\phi^{(\chi)}$ rejecting the null hypothesis when $Z_{\chi}[\widehat{R}_{k_{0}}]>\underline{c}^{(\chi)}v_{\alpha,\delta}^{(\chi)}$ , where the numerical constant $\underline{c}^{(\chi)}$ is introduced in the proof of the following proposition.

Proposition 3.

There exist numerical constants $\underline{c}^{(\chi)}$ and $c$ and such that the following holds under Condition ( $\mathbf{A}[\alpha\wedge\beta\wedge\delta]$ ). The test $\phi^{(\chi)}$ satisfies (P1[ $\alpha+\delta$ ]) and (P2[ $\beta$ ]) on the collection

[TABLE]

It turns out that a combination of $\phi^{(t)}$ and $\phi^{(\chi)}$ is matching the minimax lower bound of Theorem 1 when $k_{0}\leq\sqrt{p}$ . For larger null hypotheses, we need to rely on more intricate tests that are discussed in the next section.

2.2.3 Test $\phi^{(f)}$ based on the empirical Fourier transform of the raw covariances

In the Gaussian sequence framework ( $p=n$ and $\mathbf{X}=\mathbf{I}_{p}$ ), [17] have recovered the optimal separation distance using test based on the empirical Fourier transform of the data. In this section, we adapt this approach in the linear regression model.

Conditionally to $Y$ , it is shown in the proof of Proposition 4 below that the normalized raw covariances $\mathbf{X}^{T}Y/\|Y\|_{2}$ follow a normal distribution with mean $\theta^{*}\|Y\|_{2}/[\sigma^{2}+\|\theta^{*}\|_{2}^{2}]$ and variance $\mathbf{I}_{p}-\theta^{*}\theta^{*T}/[\sigma^{2}+\|\theta^{*}\|_{2}^{2}]$ . Since $\|Y\|_{2}^{2}$ is concentrated around $n[\sigma^{2}+\|\theta^{*}\|_{2}^{2}]$ and assuming that $\|\theta^{*}\|_{2}^{2}$ is small compared to $\sigma^{2}$ , this implies that the raw covariances are almost distributed as a normal distribution with mean $\sqrt{n}\theta^{*}/\sigma$ and covariance $\mathbf{I}_{p}$ . This observation leads us to adapt the Fourier tests of [17] in our setting through raw covariances.

The purpose of the empirical Fourier transform statistic considered in [17] (but see also [33, 34] for previous work), is to approximate the discontinuous function $\sum_{i=1}^{p}{\mathbf{1}}_{\theta^{*}_{i}\neq 0}$ . First, introduce, for $s>0$ , the function

[TABLE]

For $Z\sim\mathcal{N}(a,1)$ , standard computations lead to $\operatorname{\mathbb{E}}[\varphi(s;Z)]=2\frac{1-\cos(sa)}{(sa)^{2}}=:g(sa)$ . In particular, the function $g$ takes values in $[0,1]$ with $g(0)=0$ and $\lim_{|a|\rightarrow\infty}g(sa)=1$ (see [17]). In some way, $g(sa)$ is a smooth approximation to ${\mathbf{1}}_{a\neq 0}$ . The larger $s$ is, the closer $g(sa)$ is to the indicator function. However, $\varphi(s;Z)$ exhibit a higher variance for large $s$ .

The conditional distribution of $\mathbf{X}^{T}Y/\|Y\|_{2}$ is close to a normal distribution with mean $\sqrt{n}\theta^{*}/\sigma$ and variance-covariance matrix $\mathbf{I}_{p}$ , provided that $\|\theta^{*}\|_{2}^{2}$ is small compared to $\sigma^{2}$ . Hence, it would be tempting to use a statistic of the form $\sum_{i=1}^{p}\varphi(s;(\mathbf{X}^{T}Y)_{i}/\|Y\|_{2})$ , which in expectation would be close to $\sum_{i=1}^{p}g(s\theta^{*}_{i})$ , which in turn would approximate $\|\theta^{*}\|_{0}$ . Unfortunately, large coordinates $|\theta^{*}_{i}|$ may perturb the concentration of the statistic since the true conditional covariance of $\mathbf{X}^{T}Y/\|Y\|_{2}$ is $\mathbf{I}_{p}-\theta^{*}\theta^{*T}/[\sigma^{2}+\|\theta^{*}\|_{2}^{2}]$ . To address this technical issue, we first correct $\theta^{*}$ by removing its large coefficients.

As in Subsection 2.2.1, the first two samples are respectively dedicated to building the Lasso estimator $\widehat{\theta}_{SL}$ and the debiased estimator $\widetilde{\theta}_{\mathbf{I}}$ . If $|[\widetilde{\theta}_{\mathbf{I}}]_{(k_{0}+1)}|>\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ , then the test $\phi^{(f)}$ introduced below rejects the null hypothesis, otherwise we define $\overline{\theta}_{\mathbf{I}}$ as

[TABLE]

In Subsection 2.2.1, we argued that, with high probability, $\|\widetilde{\theta}_{\mathbf{I}}-\theta^{*}\|_{\infty}\leq\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ . As a consequence, $\|\overline{\theta}_{\mathbf{I}}-\theta^{*}\|_{\infty}\leq 2\underline{c}^{(t)}\sigma\sqrt{\frac{\log(2p/\alpha)}{n}}$ and the support of $\overline{\theta}_{\mathbf{I}}$ is included in that of $\theta^{*}$

Finally, we use the third subsample to compute the corrected raw covariances $W_{j}={\bf X}^{(3)T}_{j}\overline{Y}^{(3)}$ with $\overline{Y}^{(3)}=Y^{(3)}-\mathbf{X}^{(3)}\overline{\theta}_{\mathbf{I}}$ relative to the linear regression model with parameter $\theta^{*}-\overline{\theta}_{\mathbf{I}}$ . Then, following the above heuristic explanation, we consider the statistic

[TABLE]

with tuning parameter $s=\sqrt{\log(e\frac{k_{0}}{\sqrt{p}})}\lor 1$ . For $j$ in the support of $\overline{\theta}_{\mathbf{I}}$ , we are already confident that $\theta^{*}_{j}$ is non zero and we do not have to rely on $\varphi$ . Finally, the test $\phi^{(f)}$ rejects the null when $Z_{f}\geq k_{0}+v^{(f)}_{\alpha}$ with $v^{(f)}_{\alpha}=s^{2}/5+se^{s^{2}/2}\sqrt{2p\log(2/\alpha)}$ .

In comparison to the original statistic of [17] for the Gaussian sequence model, we use here a slightly smaller tuning parameter $s$ and the threshold $v^{(f)}_{\alpha}$ has an additional term $s^{2}/5$ .

Proposition 4.

There exist constants $c$ , $c_{\alpha}$ , $c^{\prime}_{\alpha}$ , and $c^{\prime\prime}_{\alpha}$ such that the following holds under Condition ( $\mathbf{A}[\alpha\wedge\delta]$ ). The test $\phi^{(f)}$ satisfies (P1 $[\alpha+\delta]$ ) and (P2[ $\alpha+\delta+e^{-n/27}$ ]) on the collection of parameters $\theta^{*}$ satisfying $\|\theta^{*}\|_{0}\leq cn/\log(p/\delta)$ , $d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\leq\sigma^{2}$ and at least one of the two following conditions.

[TABLE]

The test $\phi^{(f)}$ rejects the null hypothesis when there are many small non-zero coefficients in $\theta^{*}$ . In particular, if $\theta^{*}$ contains $2k_{0}>2\sqrt{p}$ coefficients of order $\sigma(n\log(p))^{-1/2}$ , then the null hypothesis is rejected with high probability. Note that $\sigma(n\log(p))^{-1/2}$ is much smaller than the value needed to recover the position of these non-zero coefficients, which is of the order $\sigma\sqrt{\log(p)/n}$ . This behavior is reminiscent of the minimax lower bound in Theorem 1, where the squared separation distance is proven to be at least of the order $k_{0}/[n\log(p)]$ for $\Delta\geq k_{0}\geq\sqrt{p}$ .

When there are a few entries in $\theta^{*}$ that are neither large nor small - see below for more precisions, it turns out that $\phi^{(f)}$ only matches the minimax lower bound up to some $\log\log(p)$ multiplicative factor. To address this issue we need to introduce an additional test $\phi^{(i)}$ .

2.2.4 Intermediary regime: Test $\phi^{(i)}$ based on the empirical Fourier transform of the raw covariance

In this subsection, we focus on entries $\theta^{*}_{i}$ that are neither large (with respect to $\sigma\sqrt{\log(p)/n}$ ) as in the analysis $\phi^{(t)}$ nor small (with respect to $\sigma\sqrt{1/(n\log(p))}$ ) as in the analysis of $\phi^{(f)}$ . This setting turns out to be relevant for large $k_{0}$ only and we assume henceforth that $k_{0}\geq 2^{11}\sqrt{p}$ . As in the previous section, we adapt a test from [17] in the Gaussian sequence setting by applying the empirical Fourier transform to the raw covariances.

Given two tuning parameters $r$ and $l$ , define the function

[TABLE]

and the statistic

[TABLE]

In order to get a grasp of this statistic let us consider the expectation of $\eta_{r,w}(X)$ for $X\sim\mathcal{N}(x,1)$ . Simple computations (see [17]) lead to $\operatorname{\mathbb{E}}[1-\eta_{r,w}(X)]=1-\frac{1}{1-2\overline{\Phi}(r)}\int_{-r}^{r}\phi(\xi)\cos(\xi x\frac{w}{r})d\xi$ , which for large $r$ , is close to $1-\exp(-x^{2}\tfrac{w^{2}}{2r^{2}})$ . Thus, in contrast to the population function $g$ introduced in the previous subsection, which converges to $1$ at a quadratic rate, this function converges to one at an exponential rate, thereby better handling moderate values of $\theta^{*}_{i}$ . The downside of using this statistic is that $\operatorname{\mathbb{E}}[1-\eta_{r,w}(X)]$ does not lie in $[0,1]$ .

The test $\phi^{(i)}$ is an aggregation of multiple tests based on the statistics $V(r,w)$ for different tuning parameters $r$ and $w$ . Define $l_{0}=\lceil k_{0}^{4/5}p^{1/10}\rceil$ and the dyadic collection $\mathcal{L}_{0}=\{l_{0},2l_{0},4l_{0},\ldots,l_{\max}\}$ where $l_{\max}=2^{\lfloor\log_{2}(k_{0}/l_{0})\rfloor}l_{0}/4\leq k_{0}/4$ . Note that $\mathcal{L}_{0}$ is not empty if $k_{0}\geq 2^{11}\sqrt{p}$ and $p$ is large enough. Given any $l\in\mathcal{L}_{0}$ , define

[TABLE]

Then, the test $\phi^{(i)}$ rejects the null hypothesis if, for some $l\in\mathcal{L}_{0}$ ,

[TABLE]

In comparison to the test in [17], the collection of tuning parameters $\mathcal{L}_{0}$ is slightly narrower and the threshold $v^{i}_{\alpha,l}$ has an additional corrective term of the order of $\omega_{l}^{2}$ .

Proposition 5.

There exist positive constants $c,c_{\alpha},c_{\alpha}^{\prime}$ such that the following holds under Condition ( $\mathbf{A}[\alpha\wedge\delta]$ ). The test $\phi^{(i)}$ satisfies (P1 $[\alpha+\delta]$ ) and (P2[ $\alpha+\delta+e^{-n/27}$ ]) on the collection of parameters $\theta^{*}$ satisfying $\|\theta^{*}\|_{0}\leq cn/\log(p/\delta)$ , $d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\leq\sigma^{2}$ and

[TABLE]

In Comparison to Condition (14) for Proposition 4, $|\theta^{*}_{(k_{0}+q)}|$ is possibly much smaller than for $\phi^{(f)}$ in the regime where $k_{0}^{4/5}p^{1/10}\lesssim_{\alpha}q\lesssim_{\alpha}\sqrt{p}$ .

2.2.5 Aggregated test

To conclude this section, we evaluate the performances of the combination of all the previous tests. In fact, $\phi^{(i)}$ is only defined in the large $k_{0}$ regime. We take the convention that $\phi^{(i)}$ is a trivial test that always accepts the null hypothesis in the small $k_{0}$ regime. Consider the aggregated test

[TABLE]

The last test ${\mathbf{1}}\{d_{2}^{2}(\widehat{\theta}_{SL},\mathbb{B}_{0}[k_{0}])\geq\sigma^{2}/2\}$ is introduced for technical purpose to handle very dense alternatives ( $\|\theta^{*}\|_{0}\geq cn/\log(p/\delta)$ ).

Theorem 2.

Let $\delta\in(0,1)$ and $\varsigma\in(0,1)$ . There exists positive constants $c_{\varsigma}$ and $c_{\varsigma,\delta}$ such that the following holds. Assume that $p\geq c_{\varsigma}$ and that Condition ( $\mathbf{A}[\delta\wedge\delta]$ ) is satisfied. Define

[TABLE]

The test $\phi^{(ag)}$ satisfies (P1 $[\delta+4\alpha]$ ) and (P2[ $\delta+\alpha+e^{-n/27}$ ]) on the collection of parameters

[TABLE]

with $1\leq\Delta\leq p-k_{0}$ .

The case $k_{0}+\Delta\leq cn/\log(p/\delta)$ is a simple corollary of the previous results, whereas the dense case $k_{0}+\Delta>cn/\log(p/\delta)$ requires further work.

To further compare this result with the minimax lower bound of Theorem 1, we assume that $p\geq n^{1+\zeta}$ for some $\zeta>0$ . Recall that we also suppose $k_{0}\leq cn/\log(p)$ . From Theorems 1 and 2, we deduce that

Case 1: $k_{0}\leq p^{1/2-\kappa}$ with an arbitrary $\kappa\in(0,1/2)$ .

[TABLE]

Case 2: $k_{0}\geq p^{1/2+\kappa}$ with an arbitrary $\kappa\in(0,1/2)$ . For any $\varsigma\in(0,1/2)$ arbitrarily small, we have

[TABLE]

and that all these bounds are simultaneously achieved by the test $\phi^{(ag)}$ . As a consequence, $\phi^{(ag)}$ is simultaneous minimax over all $k_{0}$ and all $\Delta$ except in the regimes when $k_{0}$ is close to $\sqrt{p}$ or when $\Delta$ is close to $k_{0}$ , in which case, there is possibly a polylogarithmic difference between the minimax lower and upper bounds.

Proof.

If $k_{0}\leq p^{1/2-\kappa}$ and $p\geq n^{1+\zeta}$ , then $\log(1+\sqrt{p}/k_{0})\asymp_{\kappa}\log(p)$ . Hence (6) in Theorem 1 ensures that the square minimax separation distance is at least of the order of $\min(\frac{\Delta}{n}\log(1+\frac{\sqrt{p}}{\Delta}),\frac{1}{\sqrt{n}}+\frac{k_{0}\log(p)}{n})$ . The first term is (up to numerical constants) larger than second one when $\Delta\geq k_{0}\vee\sqrt{n}$ . For $\Delta\leq k_{0}\vee\sqrt{n}$ , we have $\log(\sqrt{p}/\Delta)\asymp_{\kappa,\zeta}\log(p)$ . Hence, the square minimax separation distance is at least of the order of $\min(\frac{\Delta}{n}\log(p),\frac{1}{\sqrt{n}}+\frac{k_{0}\log(p)}{n})$ which matches the upper bound of Theorem 2. If $k_{0}\geq p^{1/2+\kappa}$ , $p\geq n^{1+\zeta}$ , and $\Delta\leq k_{0}p^{-\varsigma}$ , then $\log(1+\sqrt{k_{0}/\Delta})\asymp_{\kappa,\zeta,\varsigma}\log(p)$ and Theorem 6 ensures that the square minimax separation distance is at least of the order of $\Delta\log(p)/n$ , matching again Theorem 2. When $\Delta\geq k_{0}$ , $\Delta\log^{2}(1+\sqrt{k_{0}/\Delta})\geq ck_{0}$ , and we deduce from Theorems 1 and 2 that the square minimax separation distance is of order of $k_{0}/[n\log(p)]$ . ∎

Let us summarize the different regimes

•

If $\Delta$ is small - first result in Cases 1 and 2 - then the squared minimax separation distance ( $\Delta\log(p)/n$ ) is the same as for signal detection ( $k_{0}=0$ ). The upper bound can be achieved using any $\sqrt{\log(p)/n}$ $l_{\infty}$ -consistent estimator of $\theta^{*}$ and simply counting the number of its large entries. In the independent setting such estimator is easily built using the raw correlation ( $\widetilde{\theta}_{\mathbf{I}}$ ) between the variables and the response. Alternatively, one could use the debiased Lasso [31, 51, 44, 32] which is valid for a wider class of ${\boldsymbol{\Sigma}}$ .

•

If $\Delta$ is large and $k_{0}$ is small - second result in Case 1 - then the squared minimax separation distance can be understood as the sum of the quantity $n^{-1/2}$ arising in signal detection and the complexity $k_{0}\log(p)/n$ of the null hypothesis. The matching upper bound is achieved by computing the $l_{2}$ norm of the residuals when plugging a suitable estimator of $\theta^{*}$ . The upper bound was already obtained in [39] (for a computationally inefficient method) but the matching minimax lower bound is new.

•

Finally, if $\Delta$ is large and $k_{0}$ is large - second result in Case 2 - then the minimax separation distance is highly non standard and depends on the complexity of the null hypothesis. Both the lower and upper bound are new. In some way, they both draw inspiration from the analysis [17] of the same problem in the Gaussian sequence framework.

In this paper, we focused on recovering the minimax separation distance in the the high-dimensional setting, namely we require $p\geq n^{1+\zeta}$ , where $\zeta>0$ is an arbitrarily small universal constant. Aside from this restriction, there are two gaps in our analysis:

•

Fist, when $p\geq c_{2}n^{2}$ and $k_{0}\in(n/\log(p),n)$ , our minimax lower bounds in Theorem 1 imply that the testing problem is almost impossible. However, for $p\leq c_{2}n^{2}$ , we did not manage to prove similar lower bounds. We conjecture that, for $p\leq c_{2}n^{2}$ and $k_{0}\in(n/\log(p),n)$ , $\rho^{*}_{\gamma}[k_{0},p]$ is huge, but we did not manage to prove it.

•

Some poly-log terms mismatch between the upper and lower bounds arise when $k_{0}$ is close to $\sqrt{p}$ - e.g. $k_{0}\in[\sqrt{p}\log^{-\zeta}(p),\sqrt{p}\log^{\zeta}(p)]$ for some $\zeta>0$ and when $\Delta$ gets close to $k_{0}$ from below - i.e. $k_{0}p^{-\zeta}\leq\Delta\leq k_{0}$ for some arbitrarily small universal constant $\zeta>0$ . In that regime, we could improve our upper bounds by adapting some higher-criticism [23] procedures as it was done in the sequence model [17]. However, even with this new procedure this would not completely close the gap. We conjecture that our minimax bound (6) is not completely sharp in that regime (see its proof for a tentative explanation).

3 General Setting

In this section, we focus on the general setting where ${\boldsymbol{\Sigma}}$ is unknown and is only assumed to belong to some class $\mathcal{U}(\eta)$ (4) for some $\eta>1$ . The noise variance $\sigma^{2}$ is also assumed to be unknown.

3.1 Minimax lower bound

Obviously, $\boldsymbol{\rho}^{*}_{g,\gamma}[k_{0},\Delta]$ is at least as large as $\rho^{*}_{\gamma}[k_{0},\Delta]$ since the covariance matrix ${\boldsymbol{\Sigma}}$ is unknown and $\mathbf{I}_{p}$ belongs to $\mathcal{U}[\eta]$ . Therefore, Theorem 1 in the previous section provides a lower bound on $\boldsymbol{\rho}^{*}_{g,\gamma}[k_{0},\Delta]$ . It turns out that that this lower bound is sometimes loose and that the general setting is actually more challenging in some regimes as shown by the following proposition.

Proposition 6.

Assume that $p\geq 2n$ . There exist positive numerical constants $c>0$ and $c^{\prime}>0$ such that for $p\geq c_{3}$ and for all $\gamma\leq 0.06$ , one has

[TABLE]

with $1\leq\Delta\leq p-k_{0}$ .

In fact, this result is a combination of Theorem 1 together with known minimax lower bounds for the detection problem ( $k_{0}=0$ ) with unknown variance [45, 46].

In comparison to the independent setting, one cannot achieve anymore the rate $1/\sqrt{n}+k_{0}\log(1+\sqrt{p}/k_{0})/n$ . Most importantly, the testing problem becomes almost impossible for dense alternative ( $\Delta\gtrsim n/\log(p)$ ) in the high-dimensional regime $p\geq n^{2}$ .

3.2 Testing procedures

We cannot rely anymore on the test $\phi^{(\chi)}$ as the noise level is unknown nor on $\phi^{(t)}$ and $\phi^{(f)}$ as their reconstruction relies on the independence of the covariates.

As in the previous sections we introduce two properties (gP1) and (gP2) characterizing the type I and II error probabilities in this setting where the noise level $\sigma$ and the covariance matrix ${\boldsymbol{\Sigma}}$ are unknown.

Property gP1. A test $\phi$ satisfies (gP1[ $\alpha$ ]) if its type I error probability is less or equal to $\alpha$ , that is $\sup_{\theta^{*}\in\mathbb{B}_{0}[k_{0}]}\sup_{\sigma>0}\sup_{{\boldsymbol{\Sigma}}\in\mathcal{U}[\eta]}\operatorname{\mathbb{P}}_{\theta^{*},\sigma,{\boldsymbol{\Sigma}}}[\phi=1]\leq\alpha$ .

Property gP2. A test $\phi$ satisfies (gP2[ $\beta$ ]) on the collection $\Theta$ of parameters if its type II error probability is uniformly less or equal to $\beta$ that is $\inf_{\sigma>0,{\boldsymbol{\Sigma}}\in\mathcal{U}[\eta]}\inf_{\theta^{*}\in\Theta}\operatorname{\mathbb{P}}_{\sigma\theta^{*},\sigma,{\boldsymbol{\Sigma}}}[\phi=1]\geq 1-\beta$ .

Note that that in the above bound $\theta^{*}$ is rescaled by $\sigma$ for homogeneity purpose. As in Section 2, we restrict our attention to sparsities $k_{0}$ that are less than $n/\log(p)$ . The numerical constants $\underline{c}^{({\bf B})}$ and $\underline{c}^{({\bf B})^{\prime}}$ in the following condition are introduced in the proof of Proposition 7 and Corollary 1.

([ $\mathbf{B}[\alpha]$ )

$(k_{0}\vee 1)\big{[}1+\log(p/\alpha)\big{]}+\log^{3}\big{(}\frac{1}{\alpha}\big{)}+\log(p)\log(\frac{1}{\alpha})\leq\underline{c}_{\eta}^{(\bf B)}n$ and $p\geq\underline{c}^{(\mathbf{B})^{\prime}}.$

In this section, we divide the sample in two subsamples $(Y^{(0)},\mathbf{X}^{(0)})$ and $(Y^{(1)},\mathbf{X}^{(1)})$ of equal size $m=n/2$ . As previously, we shall combine several tests to match the minimax lower bounds.

3.2.1 Test $\phi^{(u)}$ based on a $U$ -statistic.

The first test is specific to the moderate regime $p\leq n^{2}$ . For known $\sigma$ , we introduced in the previous section a statistic relying on the observation that $\|Y\|_{2}^{2}/n-\sigma^{2}$ estimates well $\|\theta^{*}\|_{2}^{2}$ . Then, relying on a good $k_{0}$ -sparse estimator $\widetilde{\theta}_{SL,k_{0}}$ of $\theta^{*}$ and computing the square norm of the residuals, we estimate $\|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}$ , which under the null, should be small. Here, we follow the same strategy by considering an estimator of the signal strength, still valid for unknown $\sigma$ .

In [22], Dicker tackled the problem of estimating the signal strength $\|\theta^{*}\|_{2}^{2}$ in the setting where ${\boldsymbol{\Sigma}}=\mathbf{I}_{p}$ and $\sigma^{2}$ is unknown. This led him to introduce the $U$ -statistic $N=\frac{1}{n^{2}}[Y^{T}[\mathbf{X}\mathbf{X}^{T}-\frac{1}{n}\mathrm{tr}[\mathbf{X}\mathbf{X}^{T}]\mathbf{I}_{n}]Y]$ , which is unbiased and $\sqrt{p}/n$ consistent. For general ${\boldsymbol{\Sigma}}$ , this statistic has later been shown to be concentrated around the quadratic form $\theta^{*T}{\boldsymbol{\Sigma}}^{2}\theta^{*}$ (see [47, Sect.2.1]). As a consequence, one can rely on it to test the nullity of $\theta^{*}$ .

For composite null hypotheses, we use $(Y^{(1)},\mathbf{X}^{(1)})$ to build $\widetilde{\theta}_{SL,k_{0}}$ as in Subsection 2.2.2 and then compute the residuals $\widehat{R}_{SL}$ with respect to the the second sample, $\widehat{R}_{SL}=Y^{(0)}-\mathbf{X}^{(0)}\widetilde{\theta}_{SL,k_{0}}$ . Finally, we define the normalized $U$ -statistic $Z^{(u)}$ by

[TABLE]

Conditionally to $\widetilde{\theta}_{SL,k_{0}}$ , $\widehat{R}_{SL}$ is the response of a linear regression model with parameter $(\theta^{*}-\widetilde{\theta}_{SL,k_{0}})$ , variance $\sigma^{2}\mathbf{I}_{m}$ , and random design $\mathcal{N}(0,{\boldsymbol{\Sigma}})$ . Hence, the second moment of each entry of $\widehat{R}_{SL}$ equals $\sigma^{2}+(\theta^{*}-\widetilde{\theta}_{SL,k_{0}})^{T}{\boldsymbol{\Sigma}}(\theta^{*}-\widetilde{\theta}_{SL,k_{0}})$ and $\|\widehat{R}_{SL}\|_{2}^{2}/m$ is therefore close to $\sigma^{2}+(\theta^{*}-\widetilde{\theta}_{SL,k_{0}})^{T}{\boldsymbol{\Sigma}}(\theta^{*}-\widetilde{\theta}_{SL,k_{0}})$ . Intuitively, the statistic $Z^{(u)}$ is therefore expected to be close to

[TABLE]

so that a large value for $Z^{(u)}$ suggests that $\theta^{*}$ is significantly different from a $k_{0}$ sparse vector. Setting the threshold

[TABLE]

we consider the test $\phi^{(u)}$ rejecting the null hypothesis when $Z^{(u)}>\underline{c}^{(u)}_{\eta}v^{(u)}_{\alpha}$ .

Proposition 7.

There exist three constants $\underline{c}^{(u)}_{\eta}$ , $c_{\eta}$ and $c^{\prime}_{\eta}$ such that the following holds under Condition (B( $\alpha\wedge\beta\wedge\delta$ ) and if $2n\leq p\leq c_{\eta}n^{2}\log^{-1}\big{(}\frac{2}{\alpha\wedge\beta}\big{)}$ . The test $\phi^{(u)}$ satisfies (gP1[ $\alpha+\delta$ ]) and (gP2[ $\beta$ ]) on the collection

[TABLE]

3.2.2 Recovering the $\Delta\log(p)/n$ rate with variable selection

To achieve the $\Delta\log(p)/n$ rate, it would suffice to estimate $\theta^{*}$ at the $l_{\infty}$ rate $\sigma\sqrt{\log(p)/n}$ as we did for the test $\phi^{(t)}$ in the previous section. However, we are unaware of any estimator achieving this rate uniformly over the class $\mathcal{U}(\eta)$ of covariance matrices ${\boldsymbol{\Sigma}}$ . For $k_{0}\geq\sqrt{n}$ , it is even proved that no such estimator exists [9].

Here, we adopt another strategy. We shall first estimate the support $\mathcal{S}(\theta^{*})$ of $\theta^{*}$ and count the number of large entries of the least-squares estimator of $\theta^{*}$ restricted to the estimated support $\widehat{S}$ . Of course, if $\widehat{S}=\mathcal{S}(\theta^{*})$ with high probability, then the restricted least-squares estimator $\widehat{\theta}_{\widehat{S}}$ (see below for a definition) will be close to $\theta^{*}$ in $l_{\infty}$ norm. Unfortunately, it is impossible for an estimator $\widehat{S}$ to estimate exactly the support $\mathcal{S}(\theta^{*})$ , especially when $\theta^{*}$ contains arbitrarily small coordinates.

This is why we shall require that the estimator $\widehat{S}$ satisfies a weaker property. Given $a>0$ , let $M(\mathbf{a}_{1},\frac{\theta^{*}}{\sigma})=|\{i,0<\frac{|\theta^{*}_{i}|}{\sigma}\leq\mathbf{a}_{1}\sqrt{\log(p)/m}\}|$ be the number of small but non zero coefficients of $\theta^{*}$ . Below $\mathbf{a}_{1}$ , $\mathbf{a}_{2}$ , $\mathbf{a}_{3}$ refer to three positive quantities. Recall that $\overline{S}$ is the complement of $S$ .

Property ( $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ ). A (possibly random) set $S$ is said to satisfy this property if

[TABLE]

In other words, the cardinal of $S$ is not too large compared to the sparsity of $\theta^{*}$ and the square norm of $\theta^{*}$ outside $S$ is at most as large as that of the small entries of $\theta^{*}$ . Observe that the large entries of $\theta^{*}$ are not required to belong to $S$ .

Then, given a set $S$ , we consider the restricted least-square estimator and the plug-in variance estimators

[TABLE]

For a vector $u\in\mathbb{R}^{p}$ and $c>0$ , $N[c;u]=|\{i:|u_{i}|\geq c\sqrt{\log(p)/m}\}|$ is the number of entries of $u$ larger or equal (in absolute value) than $c\sqrt{\log(p)/m}$ . Then, we define the test $\phi^{(th)}[S;c]$ rejecting the null if and only if $N[\underline{c};\widehat{\theta}_{ls,S}/\widehat{\sigma}_{S}]\geq k_{0}+1$ , which means that $\widehat{\theta}_{ls,S}$ contains at least $k_{0}+1$ large entries.

Theorem 3.

There exist constants $c$ and $c^{\prime}_{\eta}$ such that the following holds for any $p\geq 3$ . Consider any $\theta^{*}$ , $\sigma$ , ${\boldsymbol{\Sigma}}\in\mathcal{U}[\eta]$ , $\delta\in(0,1)$ , $(\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3})>0$ satisfying $c\big{[}\mathbf{a}_{2}\|\theta^{*}\|_{0}+\log\left(\frac{4}{\delta}\right)\big{]}\leq m$ , and $S$ satisfying $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ . Taking

[TABLE]

we have $\mathbb{P}^{(0)}_{\theta^{*},\sigma,{\boldsymbol{\Sigma}}}\big{[}\phi^{(th)}[S,\underline{c}_{*}]=1\big{]}\leq\delta$ if $\|\theta^{*}\|_{0}\leq k_{0}$ . Besides, $\mathbb{P}^{(0)}_{\theta^{*},\sigma,{\boldsymbol{\Sigma}}}\big{[}\phi^{(th)}[S,\underline{c}_{*}]=1\big{]}\geq 1-\delta$ if $\|\theta^{*}\|_{0}>k_{0}$ and

[TABLE]

If $S$ satisfies ( $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ ), then the test $\phi^{(th)}[S;c]$ with a suitable tuning parameter $c$ has a controlled type I error probability. Besides, its square separation distance over $\mathbb{B}_{0}[k_{0}+\Delta]$ is (up to constants depending on $\mathbf{a}_{1}$ and $\mathbf{a}_{3}$ ) of the order of $\Delta\log(p)/n$ .

In view of this general result, it suffices to build an estimator $\widehat{S}$ of the support based on $(Y^{(1)},\mathbf{X}^{(1)})$ that satisfies ( $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ ) for small $\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}$ to get the squared separation distance $\Delta\log(p)/n$ .

Unfortunately, the support of the Lasso estimator is only proved to satisfy the first part of property ( $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ ). Its number of false positives is at most of the order of $\|\theta^{*}\|_{0}$ , see [50]. It turns out that the second part of the property has only recently been proved to be achieved by non-convex penalized estimators, see [24] such as MCP estimator [49].

As in the previous section, we consider the column normalized design $\mathbf{T}^{(1)}$ . Given $\beta\in\mathbb{R}^{p}$ and two tuning parameters $b>0$ , $\lambda>0$ , the MCP criterion is defined by

[TABLE]

where $x_{+}=\max(x,0)$ . Local minimizers of the MCP criterion can be efficiently computed using the PLUS Algorithm from [49] or by approximate regularization path by [48]. It turns out that non-convex penalized estimators suffer from less bias than Lasso estimators.

Consider the square-root Lasso estimator (8) $\widehat{\theta}_{SL}$ with $\delta=1/p$ and the plug-in variance estimator $\widehat{\sigma}_{SL}=\|Y-\mathbf{X}\widehat{\theta}_{SL}\|_{2}/\sqrt{m}$ . Define the tuning parameters

[TABLE]

for some constants $\underline{c}^{(MCP)}_{\eta}$ and $\underline{c}^{{}^{\prime}(MCP)}_{\eta}$ whose range of possible values follows from [49] and [24]. The following proposition is a consequence of Corollary 1 in [24] together with Theorem 6 in [49].

Proposition 8.

There exist constants $\underline{c}^{(MCP)}_{\eta}$ , $c$ , $c^{(1)}_{\eta}$ – $c^{(4)}_{\eta}$ such that the following holds for any $\theta^{*}$ with $c^{(1)}_{\eta}\|\theta^{*}\|_{0}\leq m/\log(p)$ . With probability higher than $1-cp^{-1}$ , the support $\widehat{S}_{MCP}$ of any stationary point of the criterion (28) satisfies $\mathbf{S}[c^{(2)}_{\eta},c^{(3)}_{\eta},c^{(4)}_{\eta}]$ .

A similar result holds if we use the non-convex SCAD penalty instead of MCP from [24].

Now, we can plug the support estimator $\widehat{S}_{MCP}$ into the test $\phi^{(th)}$ with a suitable constant $\underline{c}_{*}$ . The following result is a straightforward consequence of Theorems 3 and Proposition 8 and its proof is therefore omitted.

Corollary 1.

There exist constants $c$ , $c_{\eta,\delta}$ , and $c^{\prime}_{\eta,\delta}$ such that the following holds under Condition (B( $\delta$ )). The test $\phi^{(th)}[\widehat{S}_{MCP};\underline{c}^{(MCP),*}_{\eta}]$ satisfies (gP1[ $\frac{c}{p^{2}}+\delta$ ]) and (gP2[ $\frac{c}{p^{2}}+\delta$ ]) over the collections

[TABLE]

for all $1\leq\Delta\leq c^{\prime}_{\eta,\delta}n/\log(p)$ .

3.2.3 Aggregated tests and summary

Consider some $\delta>0$ . Since the performances of the test $\phi^{(u)}$ are only assessed in the regime $p\leq c_{\eta}n^{2}\log^{-1}\big{(}\frac{2}{\delta}\big{)}$ ( $c_{\eta}$ is introduced in Proposition 7), we combine the tests $\phi^{(u)}$ and $\phi^{(th)}[\widehat{S}_{MCP};\underline{c}^{(MCP),*}_{\eta}]$ only in that regime. For larger $p$ , we solely use $\phi^{(th)}[\widehat{S}_{MCP};\underline{c}^{(MCP),*}_{\eta}]$ . Combining Proposition 7 and Corollary 1 to evaluate the separation distance of the aggregated test and comparing them with the minimax lower bounds of Proposition 6, we obtain the following characterization - note that we assume here that we are in the high dimensional regime, i.e. $p\geq n^{1+\zeta}$ where $\zeta>0$ is an arbitrarily small absolute constant.

Case 1: $p\leq n^{2-\kappa}$ with an arbitrary but fixed $\kappa\in(0,1/2)$ and $k_{0}\leq\sqrt{p}p^{-\varsigma}$ .

[TABLE]

for $1\leq\Delta\leq p-k_{0}$ .

Case 2: $p\leq n^{2-\kappa}$ with an arbitrary but fixed $\kappa\in(0,1/2)$ and $k_{0}\geq\sqrt{p}$ .

[TABLE]

for $1\leq\Delta\leq p-k_{0}$ .

Case 3: $p\geq n^{2}$ . For any $k_{0}$ and $\Delta$ smaller than $c_{\eta}n/\log(p)$ , we have

[TABLE]

whereas the problem become much more difficult for larger $\Delta$ or $k_{0}$ .

In conclusion, the aggregated test achieves the minimax separation distance except in the regime where $\sqrt{p}\leq k_{0}\leq\Delta\leq n$ where there is $\log^{2}(p)$ gap between the two squared rates.

Summing up our findings, we observe that

•

For sparse alternatives (small $\Delta$ ) - first result in Cases 1 and 2 and result in Case 3 - then the minimax separation distance is analogous to that of signal detection ( $k_{0}=0$ ), i.e. of order $\tfrac{\Delta\log(p)}{n}$ . It would be straightforward to achieve this distance if we had at our disposal a $\sqrt{\log(p)/n}$ $l_{\infty}$ -consistent estimator of $\theta^{*}$ . However, this is not possible over the class of ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ ( ${\boldsymbol{\Sigma}}$ unknown in this class) [9, 32]. This is why we use a slightly different approach that focuses on selecting most of the relevant features (as in Property $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ (25)).

•

If $\Delta$ is large and $k_{0}$ is small - second result in Case 1 - then the squared minimax separation distance is of the order of $\sqrt{p}/n$ and is the same as for signal detection ( $k_{0}=0$ ). It is achieved by a $U$ -statistic originally introduced for estimating $\|\theta^{*}\|_{2}^{2}$ when ${\boldsymbol{\Sigma}}=\mathbf{I}_{p}$ [22, 47].

•

If $\Delta$ is large and $k_{0}$ is large - second result in Case 2 - then the lower bound on the minimax separation distance reflects the complexity of the null hypothesis. The lower bound is the same as in the independent setting, described in the previous section. The upper bound is based on the same $U$ -statistic as in the previous case. Unfortunately the upper and lower bounds only match up to $\log^{2}(p)$ factor. In this general setting, we doubt that adapting the Fourier statistic of the previous section is possible and we conjecture that the squared separation distance is actually of the order $k_{0}\log(p)/n$ .

•

Finally, we emphasize that, for $\Delta$ large compared to $n/\log(p)$ and $p\geq n^{2}$ , the optimal separation distance is huge (Proposition 6). Without further assumptions, it is therefore almost impossible to test whether $\theta^{*}$ is $k_{0}$ -sparse or if $\theta^{*}$ is a dense vector when $p\geq n^{2}$ . This result is in sharp contrast with the independent setting.

3.2.4 An alternative variable selection procedure

In the previous section, we established that the test $\phi^{(th)}$ applied to the support $\widehat{S}_{MCP}$ estimated by the MCP estimator achieves the square separation rate $\Delta\log(p)/n$ . Here, we introduce an alternative to the concave penalized estimator MCP based on simple iterations of the thresholded square-root Lasso.

Starting from $\widehat{S}_{0}=\emptyset$ , the algorithm builds a subset $\widehat{S}_{t}$ of variables iteratively from a subset $\widehat{S}_{t-1}$ of variables. It is done by applying a thresholded square-root Lasso to the data projected on the orthogonal of the variables in $\widehat{S}_{t-1}$ . Then, $\widehat{S}_{t}$ is the concatenation of $\widehat{S}_{t-1}$ and the variables selected by the thresholded square-root Lasso. The procedure stops after approximately $\log(n)$ iterations, and returns the current subset. The general idea is to iteratively remove non-zero coordinates of $\theta^{*}$ so that the projected square-root Lasso estimator is less perturbed by large coordinates of $\theta^{*}$ .

We need to introduce some notation. Define $T=\lfloor\log_{2}(n)\rfloor+1$ . Assume without loss of generality that $m/T=n/(2T)$ is an integer. We divide the sample $(Y^{(1)},\mathbf{X}^{(1)})$ into $T$ subsamples $\{(\underline{Y}^{(t)},\underline{\mathbf{X}}^{(t)})\}$ of size $m/T$ . Given a $r\times d$ matrix $\mathbf{M}$ and some subset $S\subset\{1,\ldots,p\}$ , we write $\mathbf{M}_{S}$ for the $r\times d$ matrix defined by $(\mathbf{M}_{S})_{i,j}=\mathbf{M}_{i,j}{\mathbf{1}}_{\{j\in S\}}$ . Given $S$ and any $1\leq t\leq T$ , define the subspace $V[S,\underline{\mathbf{X}}^{(t)}]=\mathrm{vect}(\underline{\mathbf{X}}_{S}^{(t)})$ of $\mathbb{R}^{m/T}$ and an $(m/T-\mathrm{dim}(V[S,\underline{\mathbf{X}}^{(t)}]))\times m/T$ matrix $\underline{\boldsymbol{\Pi}}^{\perp}_{t,S}$ (measurable with respect to $\mathbf{X}^{(t)}_{S}$ ) whose corresponding linear application is null on $V[S,\mathbf{X}^{(t)}]$ and maps isometrically the orthogonal of $V[S,\mathbf{X}^{(t)}]$ to $\mathbb{R}^{m/T-\mathrm{dim}(V[S,\underline{\mathbf{X}}^{(t)}])}$ .

Next, we define the Thresholded square-root Lasso estimator. Let $\underline{m}>0$ and let $\delta>0$ . Given a $\underline{m}\times p$ matrix $\underline{\mathbf{X}}$ and a size $\underline{m}$ vector $\underline{Y}$ , we write $\underline{\mathbf{X}}_{c}$ for the subdesign matrix of $\underline{\mathbf{X}}$ where its null rows have been removed. Then, $\widehat{\theta}_{SL}(\underline{\mathbf{X}},\underline{Y})$ stands for the square-root Lasso estimator (see Equation (8)) of $(\underline{\mathbf{X}}_{c},\underline{Y})$ with parameter $\lambda=2\sqrt{\overline{\Phi}^{-1}(\delta/(2p)}$ . For the purpose of notation, we consider that $\widehat{\theta}_{SL}(\underline{\mathbf{X}},\underline{Y})\in\mathbb{R}^{p}$ and that its entries $\widehat{\theta}_{SL}(\underline{\mathbf{X}},\underline{Y})$ corresponding to null rows of $\underline{\mathbf{X}}$ are equal to zero. Using the plug-in variance estimator $\hat{\sigma}^{2}=\|\underline{Y}-\underline{\mathbf{X}}[\widehat{\theta}_{SL}(\underline{\mathbf{X}},\underline{Y})]\|_{2}^{2}/\underline{m}$ , we define the thresholding modification $\widehat{\theta}_{SL,t}(\underline{\mathbf{X}},\underline{Y})$ of $\widehat{\theta}_{SL}(\underline{\mathbf{X}},\underline{Y})$ such that

[TABLE]

where the constant $\underline{c}_{\eta}^{(SL)}$ is introduced in Lemma 1.

The set $\widehat{S}^{(ith)}$ is constructed as follows. We start with the empty support $\widehat{S}_{0}=\emptyset$ . At each step $t=1,\ldots,T$ , we project both $\underline{\mathbf{X}}^{(t)}$ and $\underline{Y}^{(t)}$ along the space $V[\widehat{S}_{t-1},\underline{\mathbf{X}}^{(t)}]$ spanned by the variables in $\widehat{S}_{t-1}$ . Then, we apply thresholded square-root Lasso to these projected data to select new variables. Finally, $\widehat{S}^{(ith)}$ is the last set $\widehat{S}_{T}$ .

Theorem 4.

There exist constants $\underline{c}^{(ith)}_{\eta}$ and $c^{\prime}_{\eta}$ such that the following holds for any $\sigma>0$ , ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ and any $\theta^{*}$ satisfying

[TABLE]

With probability higher than $1-Tp^{-2}$ , the estimator $\widehat{S}^{(ith)}$ satisfies $\mathbf{S}[\underline{c}^{(ith)}_{\eta}\sqrt{T},2T,\underline{c}^{(ith)}_{\eta}\sqrt{T/2}]$ .

It turns out that $\widehat{S}^{(ith)}$ satisfies the desired property $\mathcal{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ but with $\mathbf{a}_{1}$ and $\mathbf{a}_{3}$ that are logarithmically large. As a consequence, the squared separation distance of the corresponding test $\phi^{(th)}$ with $\widehat{S}^{(ith)}$ is of order $\Delta\log(p)\log(n)/n$ , which is optimal up to an additional $\log(n)$ term in the regime $\Delta\leq k_{0}\wedge\sqrt{p}$ .

4 Discussion

In this section, we briefly discuss several related problems.

4.1 Low-dimensional problems

Although some of our results are valid in a low-dimensional setting, we focused our attention on pinpointing the minimax separation distance in a high-dimensional regime $p\geq n^{1+\zeta}$ which is arguably the most interesting one. Let us briefly discuss the low dimensional regime $p\leq n/2$ . In the independent setting, the main difference is that the $n^{-1/2}$ rate can be improved to $\sqrt{p}/n+k_{0}\log(p)/n$ by considering the ordinary least-square estimator and computing its $l_{2}$ -norm when its $k_{0}$ largest entries are removed. In the general setting, we can recover similar upper bounds as in Section 3, but with much simpler procedures based on the ordinary least-squares estimator.

Between these two regimes, the medium-dimensional case where $p$ and $n$ are of the same order is technically challenging. Our upper bounds and lower bounds only match up to polylogarithmic factor. Deriving the sharp minimax separation distance requires further work.

4.2 Sparse inverse covariance matrices ${\boldsymbol{\Sigma}}^{-1}$ and debiased Lasso

Consider an intermediary setting where both $\sigma$ and ${\boldsymbol{\Sigma}}$ are unknown but ${\boldsymbol{\Sigma}}^{-1}$ is also restricted to have less than $\sqrt{n}/\log(p)$ non-zero entries on each rows. In this setting, the minimax lower bounds of Proposition 6 in the general setting turn out to be still valid. Indeed, the proof of Proposition 6 holds in the simpler setting where ${\boldsymbol{\Sigma}}=\mathbf{I}_{p}$ and $\sigma$ is unknown. As in the general setting, the upper bound $k_{0}\log(p)/n+\sqrt{p}/n$ is achieved by the polynomial time $U$ -statistic of Section 3.2.1. In contrast, achieving the $\sqrt{\Delta\log(p)/n}$ separation distance in the small $\Delta$ regime is now much easier than in the general setting. Whereas we introduced a refitted least-square estimator combined with the non-convex MCP regularized estimator, one can now alternatively rely on the debiased Lasso method [31, 51, 44, 32] to obtain a $\sqrt{\log(p)/n}$ $l_{\infty}$ -consistent estimator of $\theta^{*}$ and then simply count the number of its large entries. This was already done in [30] as discussed previously.

4.3 Know ${\boldsymbol{\Sigma}}$ and unknown $\sigma^{2}$ .

Consider the intermediate scenario where ${\boldsymbol{\Sigma}}=\mathbf{I}_{p}$ is the identity matrix, but $\sigma^{2}$ is unknown. As explained in the previous subsection, the minimax lower bounds of Proposition 6 stated for the general setting are still valid in this intermediate scenario. Obviously, we can also apply the testing procedures of Section 3.

However, in the general scenario, our lower and upper bounds are only matching up to a $\log^{2}(p)$ factor in the large $k_{0}$ , large $\Delta$ setting. More specifically, when $k_{0}\geq p^{1/2+\zeta}$ (for some $\zeta>0$ ) and $\Delta\geq k_{0}$ , the lower bound of Proposition 6 is of order $k_{0}/[n\log(p)]$ whereas Proposition 7 provides an upper bound of the order of $k_{0}\log(p)$ .

It turns out that, in this intermediate scenario, the gap is easily closed by adapting the Fourier-based test $\phi^{(f)}$ and $\phi^{(i)}$ introduced in Section 2. Indeed, the only place where the knowledge of $\sigma$ is necessary in these two tests is in the definition of the pre-estimator $\overline{\theta}_{\mathbf{I}}$ which is a thresholded version of $\widetilde{\theta}_{\mathbf{I}}$ (9). If we replace $\sigma$ in this threshold by the plug-in estimator of the variance based on the square-root Lasso and if we increase some constants, this modification of the tests $\phi^{(f)}$ and $\phi^{(i)}$ does not depend anymore on $\sigma$ . Besides, one can easily check that (up to some changes in the numerical constants) Propositions 4 and 5 are still valid for these tests.

4.4 Unknown ${\boldsymbol{\Sigma}}$ and known $\sigma^{2}$ .

In this case, we can improve the upper bounds of the general case by adapting the test $\phi^{(\chi)}$ from Section 2. Indeed, the statistic $Z_{\chi}(\hat{R}_{k_{0}})$ is now centered on $\|{\boldsymbol{\Sigma}}^{1/2}(\theta^{*}-\tilde{\theta}_{SL,k_{0})})\|_{2}^{2}\geq\eta^{-1}\|\theta^{*}-\tilde{\theta}_{SL,k_{0})}\|_{2}^{2}$ on the class $\mathcal{U}(\eta)$ of ${\boldsymbol{\Sigma}}$ . Hence, the corresponding test achieves a squared separation distance of the order of $n^{-1/2}+k_{0}\log(p)/n$ . The main difference with the independent case is that we are not able to adapt the Fourier-based test $\phi^{(f)}$ and $\phi^{(i)}$ to unknown ${\boldsymbol{\Sigma}}$ . In regimes where $p^{1/2+\kappa}\leq k_{0}\leq c_{\gamma}\frac{n}{\log(p)}$ and $\Delta\geq k_{0}$ , there is therefore a $\log^{2}(p)$ gap between our upper and lower bounds.

5 Proofs of the minimax upper bounds

5.1 Some results on the square-root Lasso and a simple debiased Lasso

We start with a few probability bounds for the square-root Lasso $\widehat{\theta}_{SL}$ and its thresholded modification $\widetilde{\theta}_{SL,k_{0}}$ where only the $k_{0}$ largest values of $\widehat{\theta}_{SL}$ are not set to [math]. They almost follow straightforwardly from earlier results [4, 36, 42, 40]. As we shall apply this lemma in different contexts, we reintroduce the setting here. We consider a $m\times q$ linear regression model $Y=\mathbf{X}\theta^{*}+\sigma\epsilon$ with $\epsilon\sim\mathcal{N}(0,\mathbf{I}_{m})$ and where the rows of $\mathbf{X}$ are independent and follow a centered normal distribution with common covariance matrix ${\boldsymbol{\Sigma}}$ . The $m\times q$ matrix $\mathbf{T}$ is the column normalized version of $\mathbf{X}$ i.e.

[TABLE]

We take

[TABLE]

and consider the square-root Lasso estimator [4, 42] ,

[TABLE]

Then, define $(\widehat{\theta}_{SL})$ as $(\widehat{\theta}_{SL})_{i}=(\widehat{\theta}_{SL,N})_{i}/\|\mathbf{X}_{.,i}\|_{2}$ for any $i=1,\ldots,q$ and $\widetilde{\theta}_{SL,k_{0}}=\operatorname*{arg\,min}_{\theta\in\mathbb{B}_{0}[k_{0}]}\|\theta-\widehat{\theta}_{SL}\|_{2}^{2}$ . The plug-in variance estimator is $\widehat{\sigma}_{SL}=\|Y-\mathbf{X}\widehat{\theta}_{SL}\|_{2}/\sqrt{m}$ .

Lemma 1.

Fix any $0<\delta\leq 1/2$ and any $\eta\geq 1$ . There exist constants $\underline{c}^{(SL)}_{\eta}$ and $\underline{c}^{(SL),2}_{\eta}$ such that the following holds. Let $k_{\max}$ be the largest integer such that

[TABLE]

For any $\sigma>0$ , ${\boldsymbol{\Sigma}}\in\mathcal{U}[\eta]$ and $\theta^{*}$ with $\|\theta^{*}\|_{0}\leq k_{\max}$ , there exists an event $\mathcal{E}$ of probability higher than $1-\delta$ , such that

[TABLE]

Proof of Lemma 1.

We first argue that the design matrix $\mathbf{T}$ satisfies the compatibility property (see [36, 42]) with any set of size less than $k_{\max}$ and constant depending on $\eta$ . Indeed, Corollary 1 in [40] enforces that this property holds with probability higher than $1-qe^{-cm}\geq 1-\delta/2$ . Then, we are in position to apply Theorem 1 in [42], which implies that $\widehat{\sigma}_{SL}/\sigma$ belongs to $[3/4,5/4]$ and that

[TABLE]

The second result of the lemma is a consequence of the first result. Denote $S_{1}$ (resp. $S_{2}$ ) the subset of the $k_{0}$ largest entries of $\widehat{\theta}_{SL}$ (resp. $\theta^{*}$ ). From the definition of $\widetilde{\theta}_{SL,k_{0}}$ , we deduce that

[TABLE]

The result follows. ∎

5.2 Analysis of the tests $\phi^{(t)}$ , $\phi^{(\chi)}$ , and $\phi^{(u)}$

5.2.1 Proof of Proposition 2 (Test $\phi^{(t)}$ )

We start with a $l_{\infty}$ error bound on $\widetilde{\theta}_{\mathbf{I}}$ .

Lemma 2.

There exists a constant $c$ such that the following holds under ( $\mathbf{A}[\alpha\wedge\delta]$ ). For any $\theta^{*}\in\mathbb{R}^{p}$ with $\|\theta^{*}\|_{0}\leq k_{\max}$ (with $k_{\max}$ as in Lemma 1), we have

[TABLE]

with probability higher than $1-\delta-\alpha$ . Besides, for any $\theta^{*}\in\mathbb{R}^{p}$ , we have

[TABLE]

with probability higher than $1-\alpha$ .

From Lemma 2, we derive that with probability higher than $1-\delta-\alpha$ , we have $\|\theta^{*}-\widetilde{\theta}_{\mathbf{I}}\|_{\infty}\leq c\sigma\sqrt{\frac{\log(p/\alpha)}{n}}$ . Setting $\underline{c}^{(t)}$ as $c$ in Lemma 2, we derive that the test $\phi^{(t)}$ has a type I error probability less or equal to $\alpha+\delta$ . Now consider a vector $\theta^{*}\in\mathbb{B}_{0}[k_{0}+\Delta]$ such that $|\theta^{*}_{(k_{0}+1)}|\geq 2.1\underline{c}^{(t)}\sigma\sqrt{\log[p/(\alpha\wedge\beta)]/n}$ . From Lemma 2, we deduce that, with probability higher than $1-\alpha-\beta$ ,

[TABLE]

and the test $\phi^{(t)}$ therefore rejects the null hypothesis, which concludes the proof.

Proof of Lemma 2.

Set $\widehat{\gamma}=\theta^{*}-\widehat{\theta}_{SL}$ . If $\|\theta^{*}\|_{0}\leq k_{\max}$ , the conditions of Lemma 1 are satisfied and it follows from this lemma that

[TABLE]

with $\operatorname{\mathbb{P}}^{(1)}_{\theta^{*},\sigma}$ probability higher than $1-\delta$ . In the second result of Lemma 2, we restrict ourselves to the case $\|\widehat{\gamma}\|_{2}^{2}\leq 2\sigma^{2}$ . Hence, it suffices to prove that, conditionally to $\widehat{\gamma}$ satisfying $\|\widehat{\gamma}\|_{2}^{2}\leq(2\vee c^{\prime})\sigma^{2}$ , we have $\|\theta^{*}-\widetilde{\theta}_{\mathbf{I}}\|_{\infty}\leq c_{1}\sigma\sqrt{\frac{\log(p/\alpha)}{n}}$ with probability higher than $1-\alpha$ .

Write $Z$ the statistic defined by $Z=\theta^{*}-\widehat{\theta}_{SL}-\frac{1}{m}\mathbf{X}^{(2)T}(Y^{(2)}-\mathbf{X}^{(2)}\widehat{\theta}_{SL})$ . Also, define $\widehat{{\boldsymbol{\Sigma}}}=\frac{1}{m}\mathbf{X}^{(2)T}\mathbf{X}^{(2)}$ and $\widehat{{\boldsymbol{\Gamma}}}$ its diagonal part. We have

[TABLE]

We control each of these three quantities independently.

Lemma 3.

Let $\mathbf{Q}$ be a $d\times d$ symmetric matrix and let $G\sim\mathcal{N}(0,\mathbf{I}_{d})$ . Define $S=G^{T}\mathbf{Q}G$ . For any $t>0$ , one has

[TABLE]

with probability higher than $1-e^{-t}$ . Here, $\|\mathbf{Q}\|_{F}$ and $\|\mathbf{Q}\|_{op}$ respectively correspond to the Frobenius and operator norm of $\mathbf{Q}$ .

This result is a slight extension of Lemma 1 in [37] (that requires $\mathbf{Q}$ to be positive). The extension to general symmetric matrices proceeds from the same arguments and we omit the proof.

Let us first control $A_{3}$ . Each of the $p$ entries of $\sigma m^{-1}\mathbf{X}^{(2)T}\epsilon^{(2)}$ is distributed as a quadratic form of $2m$ standard normal random variables. The corresponding matrix $\mathbf{Q}$ satisfies $\operatorname{tr}(\mathbf{Q})=0$ and $\|\mathbf{Q}\|_{F}^{2}=\sigma^{2}/(2m)$ . Since $\|\mathbf{Q}\|_{op}\leq\|\mathbf{Q}\|_{F}$ , it follows from the above lemma together with an union bound that

[TABLE]

with $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability higher than $1-\alpha/3$ . As for $A_{1}$ and $A_{2}$ , we first work conditionally to $\widehat{\gamma}$ . Fix $i\in\{1,\ldots,p\}$ , $\widehat{{\boldsymbol{\Gamma}}}_{ii}$ is distributed as quadratic form of $m$ standard normal variable and the corresponding matrix $\mathbf{Q}$ satisfies $tr(\mathbf{Q})=1$ , $\|\mathbf{Q}\|_{F}=m^{-1/2}$ and $\|\mathbf{Q}\|_{op}\leq 1/m$ . It then follows from Lemma 3, that conditionally to $\widehat{\gamma}$ ,

[TABLE]

with $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability higher than $1-\alpha/3$ . As for $A_{2}$ , observe that, conditionally to $\widehat{\gamma}$ , $[(\widehat{{\boldsymbol{\Sigma}}}-\widehat{{\boldsymbol{\Gamma}}})\widehat{\gamma}]_{j}=\frac{1}{m}\sum_{i}\mathbf{X}_{i,j}\sum_{j^{\prime}\neq j}\mathbf{X}_{i,j^{\prime}}\hat{\gamma}_{j^{\prime}}$ is distributed as $(\sum_{j^{\prime}\neq j}\widehat{\gamma}_{j^{\prime}}^{2})^{1/2}/m\sum_{q=1}^{m}U_{q}U^{\prime}_{q}$ where the $U_{q}$ ’s and $U^{\prime}_{q}$ ’s are independent standard normal random variables. Again, we deduce from Lemma 3 that, conditionally to $\widehat{\gamma}$ ,

[TABLE]

with $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability higher than $1-\alpha/6$ . Finally, we gather (33) with (34–36) to conclude that there exists $c>0$ such that

[TABLE]

with probability larger than $1-\delta-\alpha$ .

∎

5.2.2 Proof of Proposition 3 (Test $\phi^{(\chi)}$ )

We first state the following lemma that characterizes the deviations of $Z_{\chi}[R]$ .

Lemma 4.

For any $t>0$ , any $\theta^{*}\in\mathbb{R}^{p}$ , any $\sigma>0$ , and any fixed $\theta$ , we have for, $\widehat{R}=Y^{(2)}-\mathbf{X}^{(2)}\theta$ ,

[TABLE]

with $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability higher than $1-e^{-t}$ .

Proof of Lemma 4.

We have

[TABLE]

Hence, $\widehat{R}_{i}\sim\mathcal{N}(0,\sigma^{2}+\|\theta^{*}-\theta\|_{2}^{2})$ and these variables are independent from each other. So the random variable $\|\widehat{R}\|_{2}^{2}[\sigma^{2}+\|\theta^{*}-\theta\|_{2}^{2}]^{-1}$ follows a $\chi^{2}$ distribution with $n$ degrees of freedom. To prove the result, we only have to apply Lemma 3 with $\mathbf{Q}=\mathbf{I}_{p}$ . ∎

First assume that $\theta^{*}$ belongs to $\mathbb{B}_{0}[k_{0}]$ . With $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability higher than $1-\alpha$ , we have

[TABLE]

where we used Condition ( $\mathbf{A}[\alpha\wedge\delta$ ]). Then, we apply Lemma 1 to control $\|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}$ with probability higher than $1-\delta$ . With probability higher than $1-\alpha-\delta$ , we get

[TABLE]

so that choosing the constant $\underline{c}^{(\chi)}$ large enough leads to $({\bf P1}[\alpha+\delta])$ .

Now assume that $d_{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])>0$ . Since $\widetilde{\theta}_{SL,k_{0}}$ is $k_{0}$ -sparse, it follows that $\|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\geq d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}$ . Then, Lemma 4 enforces that, for $\log(1/\beta)$ small enough compared to $n$ (which is ensured by Condition $({\bf A}[\alpha\wedge\beta\wedge\delta])$ ), one has

[TABLE]

with $\operatorname{\mathbb{P}}^{(2)}_{\theta^{*},\sigma}$ probability larger than $1-\beta$ . As a consequence, under condition (11) with a constant $c$ large enough, the type II error probability is less than $\beta$ .

5.2.3 Proof of Proposition 7 (test $\phi^{(u)}$ )

The following lemma is borrowed from Theorem 2.1 in [47].

Lemma 5.

There exist numerical constants $c>0$ and $c^{\prime}>0$ such that the following holds. Assume that $p\geq m$ . Consider any $\theta^{*}\in\mathbb{R}^{p}$ , any $\sigma>0$ , and any ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ . Given any estimator $\widehat{\theta}$ based on the subsample $(Y^{(1)},\mathbf{X}^{(1)})$ , define $\widehat{R}=Y^{(0)}-\mathbf{X}^{(0)}\widehat{\theta}$ . We have, conditionally on $(\mathbf{X}^{(1)},Y^{(1)})$ ,

[TABLE]

for all $t\leq n^{1/3}$ .

First, assume that $\theta^{*}$ belongs to $\mathbb{B}_{0}[k_{0}]$ . Since Condition (B[ $\alpha,\beta$ ]) is satisfied with a constant $\underline{c}_{\mathbf{B}}^{\eta}$ large enough, we can apply (38). With probability higher than $1-\alpha$ , one has

[TABLE]

Then, we use Lemma 1 to conclude that

[TABLE]

with probability higher than $1-\alpha-\delta$ . Setting the constant $\underline{c}_{\eta}^{(u)}$ small enough, we conclude that the type I error probability of $\phi^{(u)}$ is less than $\alpha+\delta$ .

Now assume that $d_{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])>0$ . Since $\widetilde{\theta}_{SL,k_{0}}$ is $k_{0}$ -sparse, it follows that $\|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\geq d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}$ . Then, Lemma 5 enforces that, for $\log(1/\beta)$ small enough compared to $n$ , with probability higher than $1-\beta$ , one has

[TABLE]

where we used in the second and fourth line that ${\boldsymbol{\Sigma}}$ belongs $\mathcal{U}(\eta)$ . Now assume that $d_{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])$ is large enough so that Condition (24) is satisfied. Choosing the constant $c^{\prime}_{\eta}$ in (24) large enough and the constant $c_{\eta}$ small enough, it then follows that the type II error probability is smaller than $\beta$ .

5.3 Analysis of $\phi^{(f)}$ and $\phi^{(i)}$ (Propositions 4 and 5)

In the proofs of this subsection, we set

[TABLE]

To alleviate the notation, and since $\overline{\theta}_{\mathbf{I}}$ only depends on the first two subsamples, $\theta$ can be considered as fixed when we condition to these subsamples. To simplify the notation, we respectively write henceforth $Y$ and $\mathbf{X}$ for $\overline{Y}^{(3)}$ and $\mathbf{X}^{(3)}$ and work conditionally to $\theta$ . For any $1\leq i\leq m$ and $1\leq j\leq p$ , we have $\operatorname{Var}\left(Y_{i}\right)=\sigma^{2}+\|\theta\|_{2}^{2}$ and $\operatorname{Cov}\left(Y_{i},\mathbf{X}_{i,j}\right)=\theta_{j}$ . Hence, we have

[TABLE]

and since the $\mathbf{X}_{i,.}|(Y,\theta)$ are independent, we have

[TABLE]

For $W=\mathbf{X}^{T}Y$ , it holds that

[TABLE]

As a consequence, given $\|Y\|_{2}$ and $\theta$ , $\frac{W}{\|Y\|_{2}}$ behaves almost like a standard Gaussian vector. We shall prove that, under the condition of the propositions, the term $\frac{\theta\theta^{T}}{\sigma^{2}+\|\theta\|_{2}^{2}}$ in the covariance turns out to be negligible, whereas $\theta\|Y\|_{2}/[\sigma^{2}+\|\theta\|_{2}^{2}]$ is closely related to $\sqrt{n}\theta^{*}/\sigma$ . The following lemma states that the conditional expectations of $Z_{f}$ and $V(r_{l},\omega_{l})$ are almost the same as if the conditional covariance of $W/\|Y\|_{2}$ was the identity matrix. Recall the function $g$ introduced in Section 2.2.3. Define the function $\Psi_{l}(x)$ by $\Psi_{l}(x)=\operatorname{\mathbb{E}}[\eta_{r_{l},\omega_{l}}(X)]$ where $X\sim\mathcal{N}(x,1)$ . As explained in Section 2.2.4 and proved in [17] (Section C.2.3), $\Psi_{l}(x)=\frac{1}{1-2\overline{\Phi}(r_{l})}\int_{-r_{l}}^{r_{l}}\phi(\xi)\cos(\xi x\tfrac{\omega_{l}}{r_{l}})d\xi$ . Obviously, we have $\Psi_{l}(0)=1$ . Besides it is also shown in [17] (Section C.2.3) that $-\frac{l}{k_{0}}\leq\Psi_{l}(x)\leq\frac{l}{k_{0}}+2\exp\left(-\frac{\omega_{l}^{2}x^{2}}{r_{l}^{2}}\right)$ .

Lemma 6.

If $s\|\theta\|_{\infty}\leq\sigma$ , we have

[TABLE]

Consider any $l\in\mathcal{L}_{0}$ . If $\omega_{l}\|\theta\|_{\infty}\leq\sigma$ , we have

[TABLE]

Also, the next lemma enforces that the deviations of the statistics $Z_{f}$ and $V(r_{l},\omega_{l})$ are almost the same as if the conditional covariance of $W/\|Y\|_{2}$ was the identity matrix.

Lemma 7.

Assume that $\|\theta\|_{\infty}\leq[\sigma^{2}+\|\theta\|_{2}^{2}]^{1/2}/s$ . For any $t>0$ , one has

[TABLE]

Besides, for any $l\in\mathcal{L}_{0}$ and any $t>0$ , one has

[TABLE]

Analysis of the tests under the null hypothesis. The assumptions of Lemma 2 are fulfilled. As a consequence, we have $\|\theta^{*}-\widetilde{\theta}_{\mathbf{I}}\|_{\infty}\leq\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ with probability larger than $1-\delta-\alpha/2$ . Henceforth, we call this event $\mathcal{B}$ and work conditionally to it. Thus, the support of $\overline{\theta}_{\mathbf{I}}$ is included in that of $\theta^{*}$ which in turn implies that $\sum_{j=1}^{p}{\mathbf{1}}_{\theta_{j}\neq 0}{\mathbf{1}}_{(\overline{\theta}_{\mathbf{I}})_{j}=0}+{\mathbf{1}}_{(\overline{\theta}_{\mathbf{I}})_{j}\neq 0}\leq k_{0}$ which implies $\|\theta\|_{0}\leq k_{0}$ . Besides, we also have $\|\theta\|_{\infty}\leq\underline{c}^{(t)}\sigma\sqrt{\frac{\log(2p/\alpha)}{n}}$ .

Since $\max_{l\in\mathcal{L}_{0}}\omega_{l}\leq s=\sqrt{\log(ek_{0}/\sqrt{p})}\lor 1\leq(\underline{c}^{(t)})^{-1}\sqrt{\frac{n}{\log(2p/\alpha)}}$ for $n\geq 9\underline{c}^{(t)2}\log^{2}(ep/\alpha)$ , it also follows from Assumption $\mathbf{A}[\alpha\wedge\delta]$ that

[TABLE]

Thus, we are in position to apply Lemma 6. As explained in Section 2.2.3, we have $g(0)=0$ and $g(x)\in[0,1]$ , it follows from that Lemma that $\operatorname{\mathbb{E}}^{(3)}_{\theta^{*},\sigma}\big{[}Z_{f}\big{|}(\|Y\|_{2},\theta)\big{]}\leq k_{0}+s^{2}/5$ . Also since $1-\Psi_{l}(0)=0$ and $1-\Psi_{l}(x)\in[0,1+\frac{l}{k_{0}}]$ we have

[TABLE]

Then, we apply the deviation inequalities of Lemma 7 and integrate them with respect to $\|Y\|_{2}$ to conclude that

[TABLE]

Taking the probability of the event $\mathcal{B}$ into account, we conclude that the type I error probability of both tests is bounded by $\alpha+\delta$ .

Analysis of the tests under the alternative hypothesis. Since $\|\theta^{*}\|_{0}$ is not too large, the assumptions of Lemma 2 are fulfilled. As under the null hypothesis, we have $\|\theta^{*}-\widetilde{\theta}_{\mathbf{I}}\|_{\infty}\leq\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ with probability higher than $1-\delta-\alpha/2$ and we still work conditionally to this event called $\mathcal{B}$ . If $(\widetilde{\theta}_{\mathbf{I}})_{(k_{0}+1)}\geq\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ , then both tests reject the null hypothesis, so that we can assume henceforth that $(\overline{\theta}_{\mathbf{I}})_{k_{0}+1}=0$ .

Since (44) is still valid, we are in position to apply again Lemmas 6 and 7. Hence, conditionally on $\theta$ and $\|Y\|_{2}$ , we have

[TABLE]

with probability higher than $1-\alpha/2$ . Define $\tilde{v}$ by

[TABLE]

Recall that $\lim_{x\rightarrow+\infty}g(x)=1$ and $\forall x\in\mathbb{R}$ , $0\leq g(x)\leq 1$ (see [17]). So it holds that

[TABLE]

Also, for any $l\in\mathcal{L}_{0}$ , we have

[TABLE]

with probability larger than $1-\alpha/2$ . As above, we have $\lim_{x\rightarrow\infty}\Psi_{l}(x)=0$ , and so

[TABLE]

In the sequel, we show that (45) and (46) imply the desired type II error probability bounds.

Case 1: Analysis of (45) for $\phi^{(f)}$ . Write $\overline{s}=\sqrt{\log(e\frac{k_{0}^{2}}{p})}\lor 1$ the tuning parameter used in [17] for the corresponding test in the Gaussian sequence model. Note that $s\geq\overline{s}/\sqrt{2}$ , $se^{s^{2}/2}\leq 2\overline{s}^{-1}e^{\overline{s}^{2}/2}$ , and $s^{2}/5\leq se^{s^{2}/2}$ . We have shown in the proof of Proposition 2 in [17] that for a vector $x\in\mathbb{R}^{p}$ and any $\alpha\in(0,1)$

[TABLE]

as soon as one of the two following condition holds for constants $c_{\alpha},c^{\prime}_{\alpha},c^{\prime\prime}_{\alpha}$ positive and large enough, depending only on $\alpha$

[TABLE]

It then follows from (45), that, given $\theta$ and $\|Y\|_{2}^{2}$ satisfying $\mathcal{B}$ , the test rejects the null with probability higher than $1-\alpha/2$ if

[TABLE]

Recall that, for $i\notin\mathcal{S}(\overline{\theta}_{\mathbf{I}})$ , $\tilde{v}_{i}=v_{i}=\theta_{i}\|Y\|_{2}/(\sigma^{2}+\|\theta\|_{2}^{2})=\theta^{*}_{i}\|Y\|_{2}/(\sigma^{2}+\|\theta\|_{2}^{2})$ . Since $\|Y\|_{2}^{2}/(\sigma^{2}+\|\theta\|_{2}^{2})$ follows a $\chi^{2}$ distribution with $n/3$ degrees of freedom, we have $\|Y\|_{2}^{2}\geq n(\sigma^{2}+\|\theta\|_{2}^{2})/6$ with probability higher than $1-e^{-n/27}$ (see Lemma 3). This implies that for any $i=1,\ldots,p$ , we have

[TABLE]

Lemma 8.

Assume that the event $\mathcal{B}$ holds, that $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])\leq\sigma^{2}$ , and that $(\overline{\theta}_{\mathbf{I}})_{k_{0}+1}=0$ . We have $\|\theta\|_{2}^{2}\leq c^{\prime}\sigma^{2}$ .

As a consequence, on the intersection of $\mathcal{B}$ and an event of probability higher than $1-e^{-n/27}$ , we have

[TABLE]

Together with (47) and (48), we have characterized the type II error probability of $\phi^{(f)}$ .

Case 2: Analysis of (46) for $\phi^{(i)}$ . Observe $\sqrt{e}\omega_{l}^{2}/2$ is at most of the order of $\log(p)$ and is therefore negligible compared to $\sqrt{p^{1/2}l}$ . We have shown in the proof of Proposition 3 in [17] that, for a vector $x\in\mathbb{R}^{p}$ , and for any $\alpha$ in $(0,1)$ we have

[TABLE]

for some $l\in\mathcal{L}_{0}$ , if for constants $c_{\alpha},c^{\prime}_{\alpha}$ positive and large enough, depending only on $\alpha$

[TABLE]

Actually, in Proposition 3 in [17], we had considered a wider range of $q$ ’s as the collection $\mathcal{L}_{0}$ was slightly larger, but this does not change the arguments here. In our setting, Condition (49) and (46) imply that $V[r_{l},\omega_{l}]\geq k_{0}+l+v_{\alpha,l}^{(i)}$ for some $l\in\mathcal{L}_{0}$ if

[TABLE]

Then, arguing as in Case 1, we have $|\tilde{v}_{i}|\geq c^{\prime}|\theta_{i}|/\sigma$ on the intersection of $\mathcal{B}$ and an event of probability higher than $1-e^{-n/27}$ . Putting everything together, we have controlled the type II error probability of $\phi^{(i)}$ .

Proof of Lemma 6.

In view of the conditional distribution of $W_{j}$ given $Y$ , one has

[TABLE]

Since $s\|\theta\|_{\infty}\leq\sigma$ , the remainder term is (in absolute value) less than

[TABLE]

Summing over all $j=1,\ldots,p$ such that $(\overline{\theta}_{\mathbf{I}})_{j}=0$ , we obtain the first result of Lemma 6. Turning to $V[r_{l},\omega_{l}]$ , we have

[TABLE]

As a consequence,

[TABLE]

where we used the condition $\omega_{l}\|\theta\|_{\infty}\leq\sigma$ in the second line. Summing this bound over all $j$ such that $(\overline{\theta}_{\mathbf{I}})_{j}=0$ yields the desired result.

∎

Proof of Lemma 7.

We shall apply the Gaussian concentration theorem (see e.g. [5]) to both $Z_{f}$ and $V[r_{l},\omega_{l}]$ . The covariance matrix ${\boldsymbol{\Gamma}}$ associated to the conditional distribution of $W/\|Y\|_{2}$ decomposes as $\mathbf{I}_{p}-a\frac{\theta}{\|\theta\|_{2}}\frac{\theta^{T}}{\|\theta\|_{2}}$ with $a=\|\theta\|_{2}^{2}/[\sigma^{2}+\|\theta\|_{2}^{2}]\in[0,1)$ and in particular its operators norm is less than one. Write ${\boldsymbol{\Gamma}}^{1/2}$ for a square-root of this matrix and let $U$ denote a standard Gaussian vector. Conditionally to $Y$ , $W/\|Y\|_{2}$ is distributed as $v+{\boldsymbol{\Gamma}}^{1/2}U$ . For any $u\in\mathbb{R}^{p}$ , define

[TABLE]

Given two vectors $u$ and $u^{\prime}$ , one has

[TABLE]

since the cosinus function is $1$ -Lipschitz. As a consequence, the function $u\mapsto Z(u)$ is $se^{s^{2}/2}\sqrt{p}$ -Lipschitz. The deviation inequalities (42) then follow from the Gaussian concentration theorem (see e.g. [5]).

As for $V[r_{l},\omega_{l}]$ , we argue similarly that, for $\omega_{l}>r_{l}$ , it is conditionally distributed as a Lipschitz function of a standard Gaussian vector with Lipschitz constant

[TABLE]

Since $l\geq k_{0}^{4/5}p^{1/10}$ , we have $\omega_{l}^{2}-r_{l}^{2}\geq 2\omega_{l}$ for any $l\in\mathcal{L}_{0}$ and the Lipschitz constant is therefore less than

[TABLE]

where the last inequality is a consequence of the definition of $r_{l}$ and $\omega_{l}$ and is detailed in the proof of Lemma 6 in [17].

∎

Proof of Lemma 8.

Under $\mathcal{B}$ , we have $\|\theta^{*}-\widetilde{\theta}_{\mathbf{I}}\|_{\infty}\leq\underline{c}^{(t)}\sigma\sqrt{\log(2p/\alpha)/n}$ . Hence,

[TABLE]

where we used in the second line the definition of $\overline{\theta}_{\mathbf{I}}$ and $(\overline{\theta}_{\mathbf{I}})_{k_{0}+1}=0$ and we used $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])\leq\sigma^{2}$ together with Assumption $(\mathbf{A}[\alpha\wedge\delta])$ in the last line.

∎

5.4 Proof of Theorem 2

Consider any $\theta\in\mathbb{B}_{0}[k_{0}]$ . In view of Propositions 3–5, we can bound the rejection probability as follows

[TABLE]

Since, under the null hypothesis, $\theta^{*}$ is $k_{0}$ -sparse, we have

[TABLE]

Applying Lemma 1, we derive that, with probability higher than $1-\delta$ , $d_{2}^{2}(\widehat{\theta}_{SL};\mathbb{B}_{0}[k_{0}])\leq\underline{c}_{1}^{SL}\sigma^{2}\frac{k_{0}}{n}\log(p/\delta)$ . Thus, by Condition $\mathbf{A}[\alpha\wedge\delta]$ , we have $\mathbb{P}_{\theta,\sigma}[d_{2}^{2}(\widehat{\theta}_{SL};\mathbb{B}_{0}[k_{0}]\geq\sigma^{2}/2]\leq\delta$ . From (51), we derive that $\mathbb{P}_{\theta,\sigma}[\phi^{(ag)}=1]\leq 5\delta+4\alpha$ . Looking more closely at the proof of Propositions 3–5, we observe that each occurrence of the probability $\delta$ corresponds to the same control of the square-root Lasso estimator $\widehat{\theta}_{SL}$ . As a consequence $\phi^{(ag)}$ satisfies ( ${\bf P}_{1}[\delta+4\alpha]$ ). Turning to the Type II error, we fix $\Delta\leq p-k_{0}$ and assume that $\theta^{*}\in\mathbb{B}_{0}[k_{0}+\Delta]$ .

Case 1: $\Delta\leq cn/\log(p/\delta)$ . If $k_{0}\leq p^{1/2-\varsigma}$ , then the squared separation distance $\min[\Delta\log(p)/n,1/\sqrt{n}+k_{0}\log(p)/n]$ in (20) is a consequence of Propositions 2 and 3 and is achieved by the combination of $\phi^{(t)}$ and $\phi^{(\chi)}$ . If $k_{0}\geq p^{1/2+\varsigma}$ , the squared separation distance $\Delta\log(p)/n$ is still achieved by $\phi^{(t)}$ . To prove the last part of the result, let us assume that $\theta^{*}$ is such that $\max(\phi^{(t)},\phi^{(\chi)},\phi^{(f)},\phi^{(i)})$ does not reject the null with high probability. We shall prove that this implies $d_{2}^{2}\left(\theta^{*};\mathbb{B}_{0}[k_{0}]\right)\leq c_{\alpha,\varsigma}\sigma^{2}k_{0}/[\log(p)n]$ . From Proposition 3, we have $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])\leq\sigma^{2}$ . In view of Proposition 4, we have

[TABLE]

In view of Proposition 5, we have

[TABLE]

for all $q\geq c^{\prime}_{\alpha}k_{0}^{4/5}p^{1/10}$ . Finally, Proposition 2 enforces that

[TABLE]

for all $q<c^{\prime}_{\alpha}k_{0}^{4/5}p^{1/10}$ . Putting everything together, we obtain

[TABLE]

where we used that, for $q\geq k_{0}$ , $|\theta_{(k_{0}+q)}^{*}|$ is small compared to $\sigma/\sqrt{n\log(p)}$ . This concludes the proof for Case 1.

Case 2: $\Delta\geq cn/\log(p/\delta)$ . In that case, $\Delta\log(p)/n$ is larger than $k_{0}\log(p)/n+n^{-1/2}$ and the first result in (20) is a consequence of the analysis of $\phi^{(\chi)}$ in Proposition 3. We now turn to the case $k_{0}\geq p^{1/2+\varsigma}$ and we need to prove that the squared separation distance is less than $c_{\alpha,\delta}\sigma^{2}k_{0}/[n\log(p)]$ . If $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])\geq\sigma^{2}$ , then $\phi^{(\chi)}$ rejects the null hypothesis with high probability. Thus, we can assume that $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])\leq\sigma^{2}$ . Also, we can assume that $\|\widehat{\theta}_{SL}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\leq\sigma^{2}/2$ , otherwise the test $\phi^{(ag)}$ rejects the null. Finally, we can assume that $\|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\leq\sigma^{2}/2$ , otherwise the test $\phi^{(\chi)}$ also rejects the null with high probability. By triangular inequality, $\theta^{*}$ therefore satisfies $\|\theta^{*}-\widehat{\theta}_{SL}\|_{2}^{2}\leq 2\sigma^{2}$ and we are in position to apply Lemma 2, which implies

[TABLE]

with probability higher than $1-\alpha/2$ conditionally to $\widehat{\theta}_{SL}$ . As a consequence, the event $\mathcal{B}$ involved in the proof of Propositions 4 and 5 is true. As ensuring this event is the only occurrence in the proof of these propositions where the restrictions $\|\theta^{*}\|_{0}\leq cn/\log(p/\delta)$ is needed, we conclude that, given $\mathcal{B}$ , $\max(\phi^{(f)},\phi^{(i)})$ rejects the null with probability higher than $1-\alpha/2$ if any of the conditions (14), (15), or (19) is satisfied. Similarly, Condition (52) (with $\alpha/2$ replaced by $\alpha)$ allows to adapt the proof of Proposition 2 without the restriction on $\|\theta^{*}\|_{0}$ . Thus, $\phi^{(t)}$ rejects the null with conditional probability higher than $1-\alpha$ under (10).

Arguing as Case 1, we conclude that the aggregated test rejects the null with high probability if $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])$ is large compared to $\sigma^{2}k_{0}/[n\log(p)]$ .

5.5 Proof of Theorem 3

Let $d$ denote any positive integer. Let $S\subset\{1,\ldots,d\}$ . For $u\in\mathbb{R}^{d}$ , we write $u_{S}=(u_{i}{\mathbf{1}}_{i\in S})_{i}$ for the vector in $\mathbb{R}^{d}$ whose values outside $S$ have been set to [math].

These notation are also extended to matrices. Given $r$ a positive integer and a $r\times d$ matrix $\mathbf{M}$ , we write $\mathbf{M}_{S}$ for the $r\times d$ matrix defined by $(\mathbf{M}_{S})_{i\leq r,j\leq d}=(\mathbf{M}_{i,j}{\mathbf{1}}_{\{j\in S\}})_{i\leq r,j\leq d}$ . For $R\subset\{1,\ldots,r\}$ , we also write $\mathbf{M}_{R,S}$ for the $r\times d$ -dimensional matrix such that $(\mathbf{M}_{R,S})_{i\leq r,j\leq d}=(\mathbf{M}_{i,j}\mathbf{1}\{i\in R,j\in S\})_{i\leq r,j\leq d}$ .

5.5.1 Proof of Theorem 3

Let $\delta>0$ and consider any subset $S$ satisfying the property ( ${\bf S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ ).

Lemma 9.

The exists a constant $c$ such that the following holds for all $\delta>0$ . If

[TABLE]

there exists an event $\mathcal{B}_{1}$ of probability higher than $1-\delta/2$ such that

[TABLE]

where $\lambda_{\min,S}$ and $\lambda_{\max,S}$ respectively refer to the smallest and largest eigenvalue of a matrix restricted to its coordinates in $S\times S$ .

So on the event $\mathcal{B}_{1}$ defined above, the matrix $\mathbf{X}^{(0)T}_{S}\mathbf{X}^{(0)}_{S}$ restricted to its coordinates in $S\times S$ is non-singular. Recall that the matrix $\mathbf{X}^{(0)T}_{S}\mathbf{X}^{(0)}_{S}$ is [math] outside $S\times S$ . Nevertheless, we can define its pseudo-inverse $(\mathbf{X}^{(0)T}_{S}\mathbf{X}^{(0)}_{S})^{-1}$ by considering its inverse when restricted to $S\times S$ and fixing all its remaining entries to 0. The restricted least-squares estimator $\widehat{\theta}_{ls,S}$ is then conditionally distributed as follows

[TABLE]

Define the bias $B=\theta^{*}-\operatorname{\mathbb{E}}^{(0)}[\widehat{\theta}_{ls,S}|\mathbf{X}^{(0)}]$ . On the event $\mathcal{B}_{1}$ , it follows from the definition in Equation (25) of $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ and Lemma 9 that

[TABLE]

Next, since $\widehat{\theta}_{ls,S}$ follows a normal distribution (55), we can easily bound its deviations. In particular, we deduce from (53) that there exists an event $\mathcal{B}_{2}$ of probability higher than $1-\delta/3$ such that on $\mathcal{B}_{1}\cap\mathcal{B}_{2}$ , one has

[TABLE]

Lemma 10.

Assume that $\log(6/\delta)\leq cn$ . There exists an event $\mathcal{B}_{3}$ of probability higher than $1-\delta/6$ such that on $\mathcal{B}=\cap_{i=1}^{3}\mathcal{B}_{i}$ , we have

[TABLE]

Putting everything together, we derive that, under $\mathcal{B}$ , one has

[TABLE]

This implies that, for all $i=1,\ldots,p$ ,

[TABLE]

Under the null hypothesis. Suppose that $\theta^{*}\in\mathbb{B}_{0}[k_{0}]$ . Note that (27) implies that

[TABLE]

Assume that $\theta^{*}$ belongs to $\mathbb{B}_{0}[k_{0}]$ . From (58), we deduce that, conditionally on the event $\mathcal{B}$ , one has

[TABLE]

As a consequence, the test accepts the null hypothesis under the event $\mathcal{B}$ .

Under the alternative hypothesis. We now assume that $\theta^{*}$ belongs to $\mathbb{B}_{0}[k_{0}+\Delta]$ and satisfies

[TABLE]

Consider the set $T=\big{\{}i,|\theta^{*}_{i}|\geq\sigma\underline{t}\sqrt{\frac{\log(p)}{2m}}\big{\}}$ of large coordinates of $\theta^{*}$ . In view of $d_{2}^{2}(\theta^{*};\mathbb{B}_{0}[k_{0}])$ , we have

[TABLE]

On the event $\mathcal{B}$ , it follows from (58) and the definition of $\underline{c}_{*}$ that $|(\widehat{\theta}_{ls,S})_{i}|/\widehat{\sigma}_{S}\geq\underline{c}_{*}\sqrt{\log(p)/m}$ if

[TABLE]

Observe that $\underline{t}\geq 4\sqrt{2}[1+\eta\mathbf{a}_{3}^{2}M(\mathbf{a}_{1},\frac{\theta^{*}}{\sigma})\frac{\log(p)}{m}]^{1/2}\underline{c}_{*}$ . Denoting $T_{0}=T\cap\{i:\ 4|B_{i}|\geq|\theta^{*}_{i}|[1+\eta\mathbf{a}_{3}^{2}M(\mathbf{a}_{1},\frac{\theta^{*}}{\sigma})\frac{\log(p)}{m}]^{-1/2}\}$ , we obtain that $N[\underline{c}_{*};\widehat{\theta}_{ls,S}/\widehat{\sigma}_{S}]\geq|T|-|T_{0}|$ . We can bound $\|\theta^{*}_{T_{0}}\|^{2}_{2}$ in terms of the bias $\|B\|_{2}^{2}$ and then use (56) and (59).

[TABLE]

where the inequality $M[\mathbf{a}_{1},\frac{\theta^{*}}{\sigma}]\leq\Delta$ is a consequence of (59) and $\underline{c}_{*}\geq\sqrt{2}\mathbf{a}_{1}$ . In view of Equation (60), we have $\|\theta^{*}_{T_{0}}\|^{2}_{2}<d_{2}^{2}(\theta^{*}_{T},\mathbb{B}_{0}[k_{0}])$ , which implies $|T_{0}|<|T|-k_{0}$ and therefore $N[\underline{c}_{*};\widehat{\theta}_{ls,S}/\widehat{\sigma}]>k_{0}$ . The test therefore rejects the null hypothesis under the event $\mathcal{B}$ , which concludes the proof.

Proof of Lemma 9.

We first show (53). Recall that $\mathbf{X}^{(0)}$ is independent of $S$ and that the restriction of $\mathbf{N}={\boldsymbol{\Sigma}}_{S,S}^{-1/2}\mathbf{X}^{(0)T}_{S}\mathbf{X}^{(0)}_{S}{\boldsymbol{\Sigma}}_{S,S}^{-1/2}$ to $S\times S$ follows a standard Wishart distribution - all coordinates outside $S\times S$ being [math]. by $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ , the size of the corresponding covariance matrix is less than $|S|\leq\mathbf{a}_{2}\|\theta^{*}\|_{0}$ . From e.g. [21], we deduce, on an event $\mathcal{B}_{1-1}$ of probability larger than $1-\delta/4$ , we have

[TABLE]

where $c_{R}$ is an universal constant. Assuming that $\mathbf{a}_{3}|\theta^{*}_{0}|+\log(4/\delta)$ is small compared to $m$ , we deduce that the spectrum of $\mathbf{N}$ lies in $[1/2,2]$ . Thus, under $\mathbf{B}_{1-1}$ , the spectrum of $m^{-1}(\mathbf{X}^{(0)T}_{S}\mathbf{X}^{(0)}_{S})$ restricted to its coordinates $S\times S$ lies in $[\eta^{-1}/2,2\eta]$ .

Turning to (54), we observe that $\mathbf{X}^{(0)}_{\overline{S}}\theta^{*}_{\overline{S}}$ follows a mean zero normal distribution. Using a deviation inequality for $\chi^{2}$ distribution (Lemma 3), we deduce the existence of an event $\mathcal{B}_{1-2}$ of probability larger higher $1-\delta/4$ such that $\|\frac{1}{\sqrt{m}}\mathbf{X}^{(0)}_{\overline{S}}\theta^{*}_{\overline{S}}\|_{2}\leq\|\theta^{*}_{\overline{S}}\|_{2}\sqrt{\eta}[1+\sqrt{2\log(4/\delta)/m}]\leq\|\theta^{*}_{\overline{S}}\|_{2}\sqrt{2\eta\log(4/\delta)}$ , since $m$ is large enough as assumed in Theorem 3. So from Equation (53), we deduce that, on $\mathcal{B}_{1}=\mathcal{B}_{1-1}\cap\mathcal{B}_{1-2}$ , we have

[TABLE]

∎

Proof of Lemma 10.

$\widehat{\sigma}^{2}_{S}/\operatorname{Var}(Y|\mathbf{X}_{S})$ follows a $\chi^{2}$ distribution with $m$ degrees of freedom. Using a deviation inequality for $\chi^{2}$ distribution (Lemma 3), we derive that $\widehat{\sigma}^{2}_{S}/\operatorname{Var}(Y|\mathbf{X}_{S})\in(1/2,2)$ , with probability higher than $1-e^{-cm}\geq 1-\delta/6$ . Thus, it remains to bound $\operatorname{Var}(Y|\mathbf{X}_{S})$ . From the definition of the property $\mathbf{S}[\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3}]$ , we deduce that

[TABLE]

∎

5.5.2 Proof of Proposition 8

Write $\widehat{\theta}_{MCP,N}$ for a stationary point of the MCP criterion and let $\widehat{S}_{MCP}$ denote its support. Since, we used the normalized design, we are more interested in the rescaled estimator $\widehat{\theta}_{MCP}$ defined by $(\widehat{\theta}_{MCP})_{i}=(\widehat{\theta}_{MCP,N})_{i}/\|\mathbf{X}^{(1)}_{.,i}\|_{2}$ . As explained in the proof of Lemma 10, the design matrix $\mathbf{T}$ satisfies, with probability higher than $1-1/p$ the compatibility property (see [36, 42]) with any set of size less than $n/[c^{(1)}_{\eta}\log(p)]$ , see [40]. Besides, the restricted eigenvalue condition satisfied for sparsities of size less than $n/[c^{(1)}_{\eta}\log(p)]$ are bounded by some constants depending on $\eta$ , see [21, 49] with probability higher than $1-1/p$ . From Lemma 1, we deduce that $\widehat{\sigma}_{SL}/\sigma\in(3/4,5/4)$ with probability higher than $1-1/p$ . We are therefore in position to apply Theorem 6 in [49] and Corollary 1 in Feng and Zhang [24]111Actually, our definition of MCP uses a different normalization from that in [49] and [24] and one has therefore to translate their results in our setting. provided that we chose the constant $\underline{c}^{(MCP)}_{\eta}$ large enough and $\underline{c}^{{}^{\prime}(MCP)}_{\eta}$ small enough. From Theorem 6 in [49] with $B=S^{*}$ (the support of $\theta^{*}$ ), we deduce that, with probability higher than $1-1/p$ ,

[TABLE]

Write $\widehat{\theta}^{(1)}_{ls,S^{*}}$ for the least-square estimator of $\theta^{*}$ restricted to $S^{*}$ : $\widehat{\theta}^{(1)}_{ls,S^{*}}=\operatorname*{arg\,min}_{\theta\ :\,\mathcal{S}(\theta)\subset S^{*}}\|Y^{(1)}-\mathbf{X}^{(1)}\theta\|_{2}^{2}$ as defined in Equation (26). We deduce from Corollary 1 in [24] that

[TABLE]

The restricted least-square estimator $\widehat{\theta}^{(1)}_{ls,S^{*}}$ follows a normal distributions with mean $\theta^{*}$ and covariance $(\mathbf{X}_{S^{*}}^{(1)T}\mathbf{X}_{S^{*}}^{(1)})^{-1}$ where we consider here the pseudo-inverse. The eigenvalues of $m(\mathbf{X}_{S^{*}}^{(1)T}\mathbf{X}_{S^{*}}^{(1)})^{-1}$ are bounded by the restricted eigenvalue condition on the design $\mathbf{X}^{(1)}$ . Hence, we obtain $\|\widehat{\theta}^{(1)}_{ls,S^{*}}-\theta^{*}\|_{\infty}\leq c^{\prime\prime\prime}_{\eta}\sigma\sqrt{\log(p)/m}$ with probability higher than $1-c/p$ , from some $c^{\prime\prime\prime}_{\eta}>0$ . This implies that $|\theta^{*}_{i}|\leq 2|(\widehat{\theta}^{(1)}_{ls,S^{*}})_{i}|$ if $|\theta^{*}_{i}|\geq 2c^{\prime\prime\prime}_{\eta}\sigma\sqrt{\log(p)/m}$ . We obtain

[TABLE]

The result follows.

5.5.3 Proof of Theorem 4

To alleviate the notation, we simply write $S_{t}$ for $\widehat{S}_{t}$ in this proof. For a random vector $X\sim\mathcal{N}(0,{\boldsymbol{\Sigma}})$ , we write ${\boldsymbol{\Sigma}}^{(t)}$ for the conditional variance of $X$ given $(X_{S_{t-1}},S_{t-1})$ . Standard computations for conditional variance based on Schur complement lead to

[TABLE]

where $({\boldsymbol{\Sigma}}_{S_{t-1},S_{t-1}})^{-1}$ is the pseudo-inverse of ${\boldsymbol{\Sigma}}_{S_{t-1},S_{t-1}}$ obtained by considering its inverse when restricted to $S_{t-1}\times S_{t-1}$ and setting all its remaining entries to [math].

In the sequel, we denote $Y_{\perp}^{(t)}=\underline{\boldsymbol{\Pi}}^{\perp}_{t,S_{t-1}}Y^{(t)}$ and $\mathbf{X}_{\perp}^{(t)}=\underline{\boldsymbol{\Pi}}^{\perp}_{t,S_{t-1}}\mathbf{X}^{(t)}$ . The following lemma ensures that the linear regression of $Y^{(t)}$ on $\mathbf{X}_{\perp}^{(t)}$ involves the restriction of $\theta^{*}$ to $\overline{S}_{t-1}$ .

Lemma 11.

Fix any $t\in[1;T]$ and consider the event such that $|S_{t-1}|<m/T$ . Then, given $S_{t-1}$ , the rows of $\mathbf{X}_{\perp}^{(t)}$ are independent and follow a centered normal distribution with covariance matrix ${\boldsymbol{\Sigma}}^{(t)}$ . Besides, we have

[TABLE]

The next lemma ensures that the population covariance matrix of the projected design still belongs to $\mathcal{U}[\eta]$ .

Lemma 12.

For any ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ and any set $S_{t-1}$ , The restriction of ${\boldsymbol{\Sigma}}^{(t)}$ to $\overline{S}_{t-1}\times\overline{S}_{t-1}$ belongs to $\mathcal{U}(\eta)$ .

Denote $\delta=p^{-2}$ . For $1\leq t\leq T$ , Property ( $\mathbf{Q}_{t}$ ) is said to be satisfied if there exists an event $\xi_{t}$ measurable with respect to $((\underline{Y}^{(1)},\underline{\mathbf{X}}^{(1)}),\ldots,(\underline{Y}^{(t)},\underline{\mathbf{X}}^{(t)}))$ of probability higher than $(1-\delta)^{t}$ such that the three following inequalities hold:

[TABLE]

Assume that the property $\mathbf{Q}_{T}$ holds and recall that $\widehat{S}=S_{T}$ . Since $|\mathcal{S}(\theta^{*})\setminus S_{T}|$ is an integer, $T\geq\log_{2}(n)$ , and $\|\theta^{*}\|_{0}\leq n/4$ , there exists an event of probability larger than $(1-\delta)^{T}$ such that

[TABLE]

which, with $\delta=p^{-2}$ , is the result of Theorem 4. Thus, it suffices to prove $(\mathbf{Q}_{t})$ by induction.

Lemma 13.

Assume that ${\boldsymbol{\Sigma}}\in\mathcal{U}(\eta)$ with $\eta>0$ . Assume that $|S_{t}|$ is such that

[TABLE]

(Recall that $\underline{c}_{\eta}^{(SL),2}$ is introduced in Lemma 1). Then, given $S_{t}$ , there exists an event $\mathcal{F}_{t+1}$ measurable with respect to $(\underline{Y}^{(t+1)},\underline{\mathbf{X}}^{(t+1)})$ of probability higher than $1-\delta$ such that

[TABLE]

Step $(\mathbf{Q}_{1})$ :

Recall that $S_{0}=\emptyset$ . By Lemma 13 and Equation (30), there exists an event $\mathcal{E}_{1}$ with probability higher than $1-\delta$ such that

[TABLE]

Counting the components of $\theta^{*}_{\overline{S}_{1}}$ that are larger (in absolute value) than $2\sigma\sqrt{c_{\eta}T\log(p/\delta)/m}$ , we derive that

[TABLE]

which, together with the previous bound implies

[TABLE]

So $(\mathbf{Q}_{1})$ holds.

Induction step:

Assume that $(\mathbf{Q}_{t-1})$ holds for some $T-1\geq t\geq 1$ . By $(\mathbf{Q}_{t-1})$ and on $\xi_{t-1}$ , we have that $|S_{t-1}|\leq 2(t-1)\|\theta^{*}\|_{0}\leq m/(2T)$ by Condition (30). Thus, $m/T-|S_{t-1}|$ is large enough and we can apply Lemma 13. As a consequence, there exists an event $\mathcal{E}_{t}$ of probability higher than $(1-\eta)^{t}$ such that

[TABLE]

Together with $(\mathbf{Q}_{(t-1)})$ , this implies $|S_{t}|\leq 2(t-1)\|\theta^{*}\|_{0}+2\|\theta^{*}\|_{0}=2t\|\theta^{*}\|_{0}$ . As for the proof of $(\mathbf{Q}_{1})$ , we lower bound $\|\theta^{*}_{\overline{S}_{t}}\|_{2}^{2}$ by considering separately the entries larger than (in absolute value) than $2\sigma\sqrt{c_{\eta}T\log(p/\delta)/m}$ . This leads us to

[TABLE]

where we used $(\mathbf{Q}_{t-1})$ in the second line. We have proved $(\mathbf{Q}_{t})$ . This concludes the proof.

Proof of Lemma 11.

To alleviate the notation, we simply write $S$ for $S_{t-1}$ , $\hat{S}$ for $\hat{S}^{(ith)}$ , $\mathbf{X}$ (resp. $Y$ ) for $\underline{\mathbf{X}}^{(t)}$ (resp. $\underline{Y}^{(t)}$ ), $\operatorname{\mathbb{E}}$ for the expectation $\operatorname{\mathbb{E}}^{(t)}$ , and $\underline{\boldsymbol{\Pi}}_{S}^{\perp}$ for $\underline{\boldsymbol{\Pi}}_{t,S}^{\perp}$ (in the proof of this lemma only). Besides, since $S$ has been built based on independent samples, we consider it as fixed. Also, without loss of the generality, we assume that $S=\{1,\ldots,|S|\}$ .

Define $\mathbf{Z}=\mathbf{X}-\mathbb{E}[\mathbf{X}|\mathbf{X}_{S}]$ . Since $\mathbf{X}$ follows a normal distribution, $\mathbf{Z}$ is independent of $\mathbf{X}_{S}$ . Besides, the rows of $\mathbf{Z}$ are i.i.d. distributed according to centered normal distribution with covariance matrix ${\boldsymbol{\Sigma}}^{(t)}$ . Since the rows of $\mathbf{X}$ are i.i.d., each column of $\mathbb{E}[\mathbf{X}|\mathbf{X}_{S}]$ is a linear combination of the columns of $\mathbf{X}_{S}$ . As a consequence, there exists a $|S|\times p$ matrix $\mathbf{R}$ such that $\mathbb{E}[\mathbf{X}|\mathbf{X}_{S}]=\mathbf{X}_{S}\mathbf{R}$ .

Since $T|S|<m$ and since ${\boldsymbol{\Sigma}}$ is invertible, the rank of $V[S,\mathbf{X}]$ equals $|S|$ almost surely. As a consequence, applying the orthogonal projection along $V[S,\mathbf{X}]$ to $\mathbf{X}$ leads to

[TABLE]

Since the rows of $\mathbf{Z}$ are i.i.d. with covariance ${\boldsymbol{\Sigma}}^{(t)}$ , there exists a matrix $\mathbf{U}$ with i.i.d. standard normal entries such that $\mathbf{Z}=\mathbf{U}{\boldsymbol{\Gamma}}^{(t)}$ where ${\boldsymbol{\Gamma}}$ is a square root of ${\boldsymbol{\Sigma}}^{(t)}$ . As a consequence, $\underline{\boldsymbol{\Pi}}_{S}^{\perp}\mathbf{X}=\underline{\boldsymbol{\Pi}}_{S}^{\perp}\mathbf{U}{\boldsymbol{\Gamma}}$ . Since $\mathbf{X}_{S}$ is independent of $\mathbf{U}$ it follows that, given $\mathbf{X}_{S}$ , the $(m/T-|S|)\times p$ matrix $\underline{\boldsymbol{\Pi}}_{S}^{\perp}\mathbf{U}$ is made of independent standard normal entries and the rows of $\mathbf{X}_{\perp}^{(t)}=\underline{\boldsymbol{\Pi}}_{S}^{\perp}\mathbf{X}$ therefore follow independent normal distributions with covariance matrix ${\boldsymbol{\Sigma}}^{(t)}$ .

Also we have

[TABLE]

since the columns of $\mathbf{X}_{\perp}^{(t)}$ in $S$ are equal to zero. Given $\mathbf{X}_{S}$ , $\underline{\boldsymbol{\Pi}}_{S}^{\perp}\epsilon$ is projection of a standard normal vector onto a subspace of dimension $m/T-|S|$ . As a consequence, $\underline{\boldsymbol{\Pi}}_{S}^{\perp}\epsilon$ follows a normal distribution with covariance matrix $\mathbf{I}_{m/T-|S|}$ and is independent of $\mathbf{X}$ . The result follows. ∎

Proof of Lemma 12.

For simplicity, we write $S$ for $S_{t-1}$ . Let $u$ be a normed vector supported in $\overline{S}$ . We shall prove that $u^{T}{\boldsymbol{\Sigma}}^{(t)}u$ belongs to $(1/\eta,\eta)$ . Consider a random vector $X\sim\mathcal{N}(0,{\boldsymbol{\Sigma}})$ so that $u^{T}{\boldsymbol{\Sigma}}^{(t)}u=\operatorname{Var}\left(u^{T}X|X_{S}\right)$ . Consider the $|S|+1$ size covariance matrix ${\boldsymbol{\Gamma}}$ of $((X_{i})_{i\in S},u^{T}X)$ . Then, ${\boldsymbol{\Gamma}}\in\mathcal{U}(\eta)$ and $\operatorname{Var}\left(u^{T}X|\mathbf{X}_{S}\right)=1/({\boldsymbol{\Gamma}}^{-1}_{|S|+1,|S|+1})$ , which therefore lies in $(1/\eta,\eta)$ . ∎

Proof of Lemma 13.

To alleviate the notation, we simply write $\widehat{\theta}_{(SL)}$ and $\widehat{\theta}_{(SL,t)}$ for $\widehat{\theta}_{(SL)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]}$ and $\widehat{\theta}_{(SL,t)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]}$ respectively. Recall that $S_{t+1}=S_{t}\cup\mathcal{S}(\widehat{\theta}_{(SL,t)})$ . The rows of $\underline{\mathbf{X}}_{\perp}^{(t+1)}$ corresponding to indices in $S_{t}$ are null. Therefore, $\widehat{\theta}_{(SL)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]}$ is a square-root Lasso estimator of $\underline{Y}_{\perp}^{(t+1)}$ given the restriction of $\underline{\mathbf{X}}_{\perp}^{(t+1)}$ to the rows in $\overline{S_{t}}$ . In view of Lemmas 11 and 12, we can apply Lemma 1. Thus, given $S_{t}$ , there exists an event $\mathcal{F}_{t}$ of probability higher than $1-\delta$ such that

[TABLE]

By assumption, $m/T\geq 2|S_{t-1}|$ . Since $\widehat{\theta}_{(SL,t)}$ is a hard thresholded modification of $\widehat{\theta}_{(SL)}$ at level

[TABLE]

its entry-wise error increases only at the non-zero entries of $\theta^{*}_{\overline{S}_{t}}$ and at most by $10/3\sigma\sqrt{\underline{c}_{\eta}^{(SL)}\frac{T}{m}\log(\tfrac{p}{\delta})}$ . This implies that

[TABLE]

Recall that $(S_{t+1}\setminus S_{t})$ is the support of $\widehat{\theta}_{(SL,t)}$ . Each non-zero entry of $\widehat{\theta}_{(SL,t)}$ is equal to that of $\widehat{\theta}_{(SL)}$ . As a consequence, each index in the support of $\widehat{\theta}_{(SL,t)}$ and outside the support of $\theta^{*}$ contributes at least by $2\sigma^{2}\underline{c}_{\eta}^{(SL)}\frac{T}{m}\log(\frac{p}{\delta})$ in the loss $\|\theta^{*}_{\overline{S}_{t}}-\widehat{\theta}_{(SL)}\|_{2}^{2}$ . This implies

[TABLE]

which in view of (63) leads us to $|S_{t+1}\setminus\mathcal{S}(\theta^{*}_{\overline{S}_{t}})|\leq\|\theta^{*}_{\overline{S}_{t}}\|_{0}$ and

[TABLE]

which concludes the proof.

∎

6 Proofs of the minimax lower bounds

We first state the following classical lemma that links the total variation distance with the performance of a test with composite hypotheses. Some variants of it may be found in textbooks such as [43]. For a sake of completeness, we provide a proof below.

Lemma 14.

Consider a parametric model $\{\operatorname{\mathbb{P}}_{\theta},\,\theta\in\Theta\}$ and two subsets $\Theta_{0}\subset\Theta,\Theta_{1}\subset\Theta$ . Let $\mu_{0}$ and $\mu_{1}$ be any probability measures on $\Theta$ . Denote ${\bf P}_{\mu_{i}}=\int\operatorname{\mathbb{P}}_{\theta}\mu_{i}(d\theta)$ for $i=0,1$ . Any test $\phi$ of $\Theta_{0}$ against $\Theta_{1}$ satisfies

[TABLE]

Proof of Lemma 14.

For $i=0,1$ , define the probability measure $\mu^{\prime}_{i}$ by $\mu^{\prime}_{i}[A]=\mu_{i}[A\cap\Theta_{i}]/\mu_{i}[\Theta_{i}]$ for any event $A$ . Given $\mu^{\prime}_{i}$ , let ${\bf P}^{\prime}_{\mu_{i}}=\int\operatorname{\mathbb{P}}_{\theta}\mu^{\prime}_{i}(d\theta)$ . It follows from Le Cam’s arguments that

[TABLE]

By triangular inequality, one has

[TABLE]

Obviously, the total variation distance $\|\mu^{\prime}_{0}-\mu_{0}\|_{TV}$ equals $\mu^{\prime}_{0}[\Theta_{0}]-\mu_{0}[\Theta_{0}]=\mu_{0}[\overline{\Theta_{0}}]$ .

[TABLE]

Arguing similarly for $\|{\bf P}^{\prime}_{\mu_{1}}-{\bf P}_{\mu_{1}}\|_{TV}$ and plugging these bound into (65) concludes the proof.

∎

6.1 Proof of Proposition 1

Proof of Proposition 1.

Intuitively, testing the sparsity for $k_{0}\geq n$ is impossible because $\theta^{*}$ cannot be even recovered in noiseless setting ( $\sigma=0$ ) when it contains more than $n$ non-zero entries. As the design matrix $\mathbf{X}$ is random, this argument needs to be slightly refined. Without loss of generality, we consider the case $p=n+1$ , $k_{0}=n$ and $\Delta=1$ . Let us write $\underline{\mathbf{X}}$ the submatrix of $\mathbf{X}$ made of its $n$ first columns. In order to apply Lemma 14, we shall build two suitable prior distributions on the set of $n$ and $n+1$ sparse vectors.

With probability one, the square matrix $\underline{\mathbf{X}}$ is invertible. Also denote $s_{\min}$ (resp. $s_{\max}$ ) the smallest (resp. highest) singular values of $\underline{\mathbf{X}}$ . Fix any $\delta\in(0,1)$ . As stated for instance in [41], there exist $c_{-}(n,\delta)=c_{-}>0$ , $c_{+}(n,\delta)=c_{+}>0$ such that the following holds

[TABLE]

where $\mathbb{P}_{\mathbf{X}}$ stands for the distribution of $\mathbf{X}$ . Here, $\mathbf{X}_{.,p}$ stands for the $p$ -th column of $\mathbf{X}$ . Although the exact expression of $c_{-}$ and $c_{+}$ is not relevant in this proof, these two quantities are of the order $n^{-1/2}$ and $n^{1/2}$ .We call $\mathcal{A}$ the event defined in the above probability bound.

Let $\mu_{0}$ stand for the centered Gaussian measure in $\mathbb{R}^{n+1}$ with covariance matrix $\big{(}\begin{array}[]{cc}\mathbf{I}_{n}&0\\ 0&0\end{array}\big{)}$ . We write ${\bf P}_{0,0}=\int_{\mathbb{R}^{n+1}}\operatorname{\mathbb{P}}_{\theta,0}\mu_{0}(d\theta)$ . Given any $r>0$ , define the vector $v_{r}=(0,\ldots,0,r)^{T}$ . We fix ${\bf P}_{1,r,0}=\int_{\mathbb{R}^{n+1}}\operatorname{\mathbb{P}}_{\theta+v_{r},0}\mu_{0}(d\theta)$ . We argue that, for $r$ small enough, the total variation distance $\|{\bf P}_{0,0}-{\bf P}_{1,r,0}\|_{TV}$ is smaller than $2\delta$ .

Under $\mathbf{P}_{0,0}$ , for a fixed $\mathbf{X}$ , it holds that $Y\sim\mathcal{N}(0,\underline{\mathbf{X}}\underline{\mathbf{X}}^{T})$ whereas, under $\mathbf{P}_{1,r,0}$ , it holds that $Y\sim\mathcal{N}(\mathbf{X}v_{r},\underline{\mathbf{X}}\underline{\mathbf{X}}^{T})$ . When $\underline{\mathbf{X}}$ satisfies $\mathcal{A}$ , these two covariance matrices are invertible with eigenvalues in $(c_{-}^{2},c_{+}^{2})$ and $\|\mathbf{X}v_{r}\|_{2}\leq rc_{+}$ . Thus, for $r$ going to zero, the total variation distance between these conditional distributions goes to zero uniformly over all $\mathbf{X}$ satisfying $\mathcal{A}$ . In particular, there exists some $r_{0}$ such that these distances are uniformly smaller than $\delta$ . Since $\operatorname{\mathbb{P}}(\mathcal{A})\geq 1-\delta$ , it follows that

[TABLE]

Consider $\sigma_{0}>0$ whose value will be fixed later. Define ${\bf P}_{0,\sigma_{0}}=\int_{\mathbb{R}^{n+1}}\operatorname{\mathbb{P}}_{\theta,\sigma_{0}}\mu_{0}(d\theta)$ and ${\bf P}_{1,r_{0},\sigma_{0}}=\int_{\mathbb{R}^{n+1}}\operatorname{\mathbb{P}}_{\theta+v_{r},\sigma_{0}}\mu_{0}(d\theta)$ the distributions associated to the linear regression models. By contraction properties of the total variation distances, one has

[TABLE]

When $\theta$ is sampled according to $\mu_{0}$ , then the smallest (in absolute value) entry of $\theta$ among the $n$ first entries is larger than some positive quantity $\underline{c}_{-}$ , with probability larger than $1-\delta$ . Let us call $\mathcal{B}$ the corresponding event. Define $\underline{\mu}$ as the measure $\mu_{0}$ conditioned to the event $\mathcal{B}$ , i.e. $\underline{\mu}(\mathcal{C})={\mu}_{0}(\mathcal{C}\cap\mathcal{B})/{\mu}_{0}(\mathcal{B})$ for any measurable event $\mathcal{C}$ . Then, we introduce $\underline{\bf P}_{1,r_{0},\sigma_{0}}=\int_{\mathbb{R}^{n+1}}\operatorname{\mathbb{P}}_{\theta+v_{r},\sigma_{0}}\underline{\mu}(d\theta)$ . By triangular inequality, we obtain

[TABLE]

When $\theta$ is sampled according to $\underline{\mu}$ , $(\theta+v_{r_{0}})$ satisfies $d_{2}(\theta+v_{r_{0}},\mathbb{B}_{0}[n])\geq\underline{c_{-}}\wedge r_{0}$ . As a consequence of Lemma 14, any test of $\{\|\theta\|_{0}\leq n,\sigma=\sigma_{0}\}$ versus $\{\|\theta\|_{0}\leq n+1,d_{2}(\theta+v_{r_{0}},\mathbb{B}_{0}[n])\geq\underline{c}_{-}\wedge r_{0},\sigma=\sigma_{0}\}$ has a risk higher than $1-3\delta$ . We have

[TABLE]

where $\underline{c}_{-}\wedge r_{0}$ does not depend on $\sigma_{0}$ . Taking $\sigma_{0}$ arbitrarily small leads to the desired result.

∎

6.2 Proof of Theorem 1

Given integers $k_{0}$ and $\Delta\leq p-k_{0}$ , and $\rho>0$ , we define the collection

[TABLE]

We start by a simple reduction result to narrow the range of parameters. Its proof is postponed to the end of the section.

Lemma 15.

For any $\Delta^{\prime}\leq\Delta\leq p-k_{0}$ , we have

[TABLE]

For the sake of the following bound, we explicit the dependency of $\rho_{\gamma}^{*}[k_{0},\Delta]$ with respect to $p$ by denoting it $\rho_{\gamma}^{*}[p,k_{0},\Delta]$ . For any $k^{\prime}_{0}<k_{0}<p$ and $\Delta\leq p-k_{0}$ , we have

[TABLE]

In other words, the minimax separation distance in non-decreasing with respect to $\Delta$ and, up to a change in the number $p$ of covariates, it is also nondecreasing with respect to $k_{0}$ . Next, we state three lemmas whose combination implies Theorem 1.

Lemma 16.

Assume that $p\geq 2n$ . There exists a numerical constant $c>0$ such that

[TABLE]

for any $\gamma\leq 0.53$ , all $k_{0}\leq n$ and $1\leq\Delta\leq p-k_{0}$ .

Proof of Lemma 16.

This lemma is a consequence of known signal detection lower bounds ( $k_{0}=0$ ). For instance, it is proved in [46, Sect.9.1] in

[TABLE]

for all $1\leq\Delta\leq p$ . Since $p\geq 2n$ and $k_{0}\leq n$ , Lemma 15 entails that

[TABLE]

which concludes the proof. ∎

Lemma 17.

Assume that $p\geq 2n$ . There exist constants $c_{1}$ – $c_{5}$ such that the following holds for all $\gamma\leq 0.06$ :

[TABLE]

for all $k_{0}\leq n$ and $\Delta>0$ . Furthermore, if $p\geq c_{2}n^{2}$ and $k_{0}\in(c_{3}n/\log(\sqrt{p}/n),\sqrt{p}/e^{4})$ and $\Delta\leq k_{0}$ , then

[TABLE]

Proof of Lemma 17.

In the above lemma, the minimax lower bounds both depend on the size $k_{0}$ of the null hypothesis and on the size $\Delta$ of the alternative hypothesis. As a consequence, we cannot directly rely anymore on signal detection results as in the previous lemma. Nevertheless, we will introduce a third party hypothesis and make make use of previous signal detection lower bounds for unknown $\sigma$ [45, 46].

By Lemma 15, we assume without loss of generality that $\Delta\leq k_{0}$ . Given $\rho>0$ and $1\leq k\leq p$ , we define $\mu_{\rho,k}$ as the uniform measure over the set

[TABLE]

and the mixture measure $\mathbf{P}_{\rho,k}=\int\operatorname{\mathbb{P}}_{\theta,1}\mu_{\rho,k}(d\theta)$ . As a way to derive minimax lower bounds for signal detection with unknown noise level, it is proved in [45, Theorem 4.3] and [46, Lemma 9.3]222Actually, the results in [45, 46] are expressed in terms of minimax separation distance, the total variation distance control being stated in their respective proof. that $\|\mathbf{P}_{\rho,k}-\operatorname{\mathbb{P}}_{0,1+\rho^{2}}\|_{TV}\leq 0.47$ if $\rho\leq\rho_{k}$ or if $\rho\leq\rho^{\prime}_{k}$ with

[TABLE]

Let us now deduce (70). Since $\rho_{k}$ is increasing with respect to $k$ , we have

[TABLE]

Under $\mu_{\rho_{k_{0}},k_{0}}$ , $\theta$ is $k_{0}$ -sparse, whereas under $\mu_{\rho_{k_{0}},k_{0}+\Delta}$ , $\theta$ is $k_{0}+\Delta$ -sparse and its square distance to $\mathbb{B}_{0}[k_{0}]$ is $\Delta\rho^{2}_{k_{0}}/(k_{0}+\Delta)$ . From Lemma 14, we deduce that, for $\gamma\leq 0.06$ , one has

[TABLE]

which enforces (70) since we have $\Delta\leq k_{0}$ . Turning to (71), we observe that, under the assumptions of the lemma (and with a suitable choice of $c_{2}$ ), $k\log\big{(}\tfrac{\sqrt{p}}{e^{3/2}k}\big{)}\geq n$ both for $k=k_{0}$ and $k=k_{0}+\Delta\leq 2k_{0}$ . Arguing as above, we deduce that

[TABLE]

since the expression inside the exponential is bounded away from zero and since $e^{x}\geq 1+x$ for $x>0$ . We have proved (71). ∎

The following lemma provides the key new lower bound. It corresponds to the regime where both $k_{0}$ and $\Delta$ are large. Its proof relies on more advanced arguments than the other regimes.

Lemma 18.

There exists positive numerical constant $c$ and $c_{2}$ such that the following holds for any $\gamma\leq 0.5$ and all $p\geq c_{2}$ . For any $p^{1/4}\leq k_{0}\leq n$ and $\Delta\geq k_{0}^{2/3}\vee p^{1/4}$ , one has

[TABLE]

Proof of Theorem 1.

First we prove (6). The case $k_{0}\leq\sqrt{p}$ is a consequence of Lemmas 16 and 17. As for the case $k_{0}\in(\sqrt{p},n)$ , we divide the analysis into several subcases. If $\Delta\leq p^{1/4}$ , it follows from Lemma 17 that $\rho_{\gamma}^{*2}[k_{0},\Delta]$ is at least of the order of $\Delta\log(p)/n$ which is larger than the lower bound in (6). For $\Delta\geq p^{1/4}\vee k_{0}^{2/3}$ we rely on Lemma 18. For $\Delta\in(p^{1/4},k_{0}^{2/3})$ , we define $k^{\prime}_{0}=\lfloor\Delta^{3/2}\rfloor$ . From the reduction (68) and Lemma 18, we derive that

[TABLE]

Finally, the lower bound (7) is a consequence of the second part of Lemma 17 together with the reduction lemma 15.

∎

Proof of Lemma 15.

The first bound is a simple consequence of the inclusion $\mathbb{B}_{0}[k_{0},\Delta,\rho]\subset\mathbb{B}_{0}[k_{0},\Delta^{\prime},\rho]$ . Let us turn to (68). Take any $\zeta>0$ arbitrarily small and define $r=\rho_{\gamma}^{*}[k_{0},\Delta]+\zeta$ . There exists a test $\phi$ satisfying $R[\phi;k_{0},\Delta,r]\leq\gamma$ . For any linear regression problem with $p-k_{0}+k^{\prime}_{0}$ covariates and response $Y$ , we sample $k_{0}-k^{\prime}_{0}$ new independent covariates, write $\underline{\mathbf{X}}$ the corresponding new design matrix of size $n\times(k_{0}-k^{\prime}_{0})$ , and define $\underline{Y}=Y+r\underline{\mathbf{X}}1$ where $1$ is the constant vector of size $k_{0}-k^{\prime}_{0}$ . Since $R[\phi;k_{0},\Delta,r]\leq\gamma$ , we have

[TABLE]

implying that $\rho_{\gamma}^{*}[p-k_{0}+k^{\prime}_{0},k^{\prime}_{0},\Delta]\leq r$ . Taking the infimum over all $\zeta>0$ , we obtain (68). ∎

Proof of Lemma 18.

Without loss of generality we assume that the noise level $\sigma$ is equal to one and we write $\operatorname{\mathbb{P}}_{\theta}$ for $\operatorname{\mathbb{P}}_{\theta,1}$ . Since the minimax separation distance $\rho_{\gamma}^{*}[k_{0},\Delta]$ is a nondecreasing function of $\Delta$ , we have $\rho_{\gamma}^{*}[k_{0},\Delta]\geq\rho_{\gamma}^{*}[k_{0},k_{0}]$ for any $\Delta>k_{0}$ . In view of (72) and since $\log(1+x)\geq x/2$ for any $x\in[0,1]$ , we only need to prove (72) for $\Delta\leq k_{0}$ .

Define $\overline{k}_{0}=k_{0}-\Delta/2$ and $\overline{k}_{1}=k_{0}+\Delta/2$ . We introduce two priors $\mu_{0}^{\otimes p}$ and $\mu_{1}^{\otimes p}$ that are almost supported on $\mathbb{B}_{0}[k_{0}]$ and $\mathbb{B}_{0}[k_{0}+\Delta]$ respectively and such that the first moments of $\mu_{0}$ and $\mu_{1}$ are matching. In Step 3 below, we show that this moment matching property ensures that the corresponding mixture distributions of $(Y,\mathbf{X})$ are close in total variation distance.

Step 1. Construction of the priors.

As in [17], we build prior measures $\mu_{0}$ and $\mu_{1}$ in such a way that their first moments are matching. Define the two quantities where $m$ is redefined only in this proof as follows)

[TABLE]

for some universal constant $c$ whose value will be fixed later. The following result is borrowed from [17, Lemma 3].

Lemma 19.

Given any positive and even integer $m$ and $q\in(0,1)$ , define

[TABLE]

There exists two positive and symmetric measures $\nu_{0}$ and $\nu_{1}$ whose support lie in $[-1,-a_{m}]\cup[a_{m},1]$ satisfying:

[TABLE]

Fix $q=\overline{k}_{0}/\overline{k}_{1}$ . Then, given $m=2\lfloor 2\log(p)\rfloor$ , we consider the measures $\nu_{0}$ and $\nu_{1}$ as in Lemma 19. Given any measurable event $A$ , we define $\mu_{0}$ and $\mu_{1}$ by

[TABLE]

Here, $M.A$ stands for $\{Mx:x\in A\}$ and $\delta_{0}$ is the Dirac measure at [math]. In view of this definition, the first $m$ moments of $\mu_{0}$ and $\mu_{1}$ are matching.

Step 2. Properties of the priors.

We consider the prior measures $\mu_{0}^{\otimes p}$ and $\mu_{1}^{\otimes p}$ . In view of Lemma 14, we need to show that $\mu_{0}^{\otimes p}$ is concentrated on $\mathbb{B}_{0}[k_{0}]$ and that $\mu_{1}^{\otimes p}$ is concentrated on $\mathbb{B}_{0}[k_{0},\Delta,\rho]$ for some large $\rho$ .

Under $\mu_{0}^{\otimes p}$ , $\|\theta\|_{0}$ follows a binomial distribution with parameter $(p,(k_{0}-\Delta/2)/p)$ . By Chebychev’s inequality,

[TABLE]

since $\Delta\geq p^{1/4}\vee k_{0}^{2/3}$ . Similarly,

[TABLE]

Under the event $\|\theta\|_{0}\in(k_{0}+\Delta/4,k_{0}+3\Delta/4)$ , the corresponding parameter $\theta$ satisfies

[TABLE]

Since $\arg\cosh[1+2\overline{k}_{0}/\Delta]\leq\arg\cosh[1+2p]\leq 4\log(p)$ for $p\geq 2$ and since $\tanh(t)\geq 0.4t$ for any $t\in(0,1)$ , we deduce that

[TABLE]

$\arg\cosh(x)\geq\log(x)$ . As a consequence, with probability $\mu_{1}^{\otimes p}$ larger than $1-32p^{-1/8}$ , $\theta$ belongs to $\mathbb{B}_{0}[k_{0},\Delta,\rho]$ with $\rho^{2}=c^{\prime}\frac{\Delta}{n\log(p)}\log^{2}[1+\frac{k_{0}}{\Delta}]$ . To apply Lemma 14, it remains to bound the total variation distance between

[TABLE]

Step 3. Control of $\|\mathbf{P}_{0}-\mathbf{P}_{1}\|_{TV}$ .

For $j=0,\ldots,p$ , define the distribution $\mathbf{P}^{(j)}_{0}=\int\operatorname{\mathbb{P}}_{\theta}\mu_{1}^{\otimes j}\otimes\mu_{0}^{\otimes p-j}(d\theta)$ with $\mathbf{P}^{(0)}_{0}=\mathbf{P}_{0}$ and $\mathbf{P}^{(p)}_{0}=\mathbf{P}_{1}$ . By triangular inequality, one has

[TABLE]

This upper bound greatly simplifies the following computations as the distributions $\mathbf{P}^{(j)}_{0}$ and $\mathbf{P}^{(j+1)}_{0}$ only differ by one coordinate. Unfortunately, we conjecture that our minimax lower bound in Theorem 1 is suboptimal in the regime where $k_{0}$ is close to $\sqrt{p}$ precisely because of the upper bound (81). In the arguably simpler Gaussian sequence model [17], we have directly computed the $\chi^{2}$ distances between the corresponding distributions $\mathbf{P}_{0}$ and $\mathbf{P}_{1}$ to obtain the sharp separation distance in all regimes. If we use instead the decomposition (81) for the Gaussian sequence model, this leads to a suboptimal lower bound for $k_{0}$ close to $\sqrt{p}$ . To close this gap in the linear regression model, one would therefore need to directly handle the $\chi^{2}$ distance between $\mathbf{P}_{0}$ and $\mathbf{P}_{1}$ but we were not able to do it.

In the following, we shall bound independently each of these $p$ distances $\|\mathbf{P}^{(j)}_{0}-\mathbf{P}^{(j+1)}_{0}\|_{TV}$ . Interestingly, $\mathbf{P}^{(j)}_{0}$ and $\mathbf{P}^{(j+1)}_{0}$ only differ by the distribution of the $j+1$ -th coordinate of $\theta$ . The general idea is to condition with respect to all the coordinates except the $j+1$ -th one so that we consider a linear regression model with only one covariate.

Let us write $g^{(j)}_{0}(Y|\mathbf{X})$ the conditional density of $Y$ given $\mathbf{X}$ under $\mathbf{P}^{(j)}_{0}$ .

[TABLE]

Writing down $\operatorname{\mathbb{E}}_{\mathbf{X}}$ the expectation with respect to $\mathbf{X}$ , we have

[TABLE]

by permutation invariance. We call $A_{j}$ this last quantity.

Given a $p$ -dimensional vector $\theta$ , let $\theta^{(-p)}$ be such that $\theta^{(-p)}_{p}=0$ and $\theta^{(-p)}_{j}=\theta_{j}$ for all $j<p$ . Write $\boldsymbol{\mu}_{j}=\mu_{0}^{\otimes j}\otimes\mu_{1}^{\otimes p-j-1}$ and $\mu_{\Delta}=\mu_{0}-\mu_{1}$ .

[TABLE]

where the quantities $\omega_{p}$ and $\xi_{p}$ are defined by

[TABLE]

Let $\Omega$ be the event such that $\Omega=\{|\omega_{p}|\leq 5\sqrt{n\log(p)},~{}\xi_{p}\leq 2n\}$ .

Fix any $\theta\in\mathbb{R}^{p}$ such that $\|\theta\|_{\infty}\leq M$ . Then, under $\operatorname{\mathbb{P}}_{\theta}$ , $\xi_{p}$ follows a $\chi^{2}$ distribution with $n$ degrees of freedom. As a consequence of deviation inequalities for $\chi^{2}$ distributions (Lemma 3), its probability to be larger than $2n$ is smaller than $e^{-n/16}$ . Besides, conditionally to $\mathbf{X}$ , $\omega_{p}$ follows a normal distribution with mean $\theta_{p}\|\mathbf{X}_{.p}\|_{2}^{2}$ and variance $\|\mathbf{X}_{.p}\|_{2}^{2}$ . As a consequence, under the event $\{\xi_{p}\leq 2n\}$ , the probability that $|\omega_{p}|\geq 2Mn+2\sqrt{2n\log(p)}$ is smaller than $1/p^{2}$ . In view of the definition (73) of $M$ and by taking the constant $c$ in that definition small enough, we conclude that

[TABLE]

for all $\theta$ such that $\|\theta\|_{\infty}\leq M$ . We set

[TABLE]

It follows from this definition and from Equation (82) that

[TABLE]

.

In order to work out the term $A_{j,\Omega}$ , we rely on the power expansion of $e^{\omega_{p}\theta_{p}+\xi_{p}\theta_{p}^{2}}$ together with the nullity of the $m$ first moments of $\mu_{\Delta}$ .

[TABLE]

since, by (73), $m\geq 10e(5M\sqrt{n\log(p)}+nM^{2})$ if we fix $c\leq 1/(10e)$ . Plugging this bound into the definition of $A_{j,\Omega}$ , we obtain

[TABLE]

by definition (73) of $m$ . Together with (83), this implies that $A_{j}\leq 2/p^{2}+e^{-n/16}$ . Then, we use the definition of $A_{j}$ and (81) to conclude that

[TABLE]

which is smaller than $1/4$ for $p$ large enough since the assumptions of Lemma 18 enforce that $p^{1/4}\leq n$ . In view of the above bound, (78), (79), and (79), we are in position to apply Lemma 14. Thus, for $p$ large enough, we conclude that

[TABLE]

∎

6.3 Proof of Proposition 6

Since $\boldsymbol{\rho}_{g,\gamma}^{*}[k_{0},\Delta]\geq\rho_{\gamma}^{*}[k_{0},\Delta]$ , the second part of (21) comes from Theorem 1. Turning to the first part of (21), we have already pointed out in the proof of Lemma 17 that it is proved in [45] and [46] that, for all $(k,n,p)$ one has

[TABLE]

These two bounds imply that, for $p$ large enough and for all $\Delta\leq\sqrt{p}/e^{3}$ , one has

[TABLE]

Since $\boldsymbol{\rho}_{g,\gamma}^{*}[0,\Delta]$ is nondecreasing with respect to $\Delta$ (Lemma 15), the above bound is also valid for all $\Delta\leq p$ at the price of worse constants. Finally, we apply Lemma 15 together with the assumption $k_{0}\leq n\leq p/2$ to obtain the first part of (21).

Acknowledgements.

The work of A. Carpentier is partially supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether grant MuSyAD (CA 1488/1-1), by the DFG - 314838170, GRK 2297 MathCoRe, by the DFG GRK 2433 DAEDALUS, by the DFG CRC 1294 ’Data Assimilation’, Project A03, and by the UFA-DFH through the French-German Doktorandenkolleg CDFA 01-18. The authors thank anonymous reviewers for their helpful suggestions that improved the manuscript. The authors are also grateful to Alexandre Tsybakov and Cun-Hui Zhang for bringing to our knowledge some recent work on MCP.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. Arias-Castro, E. Candes, and Y. Plan. Global Testing under Sparse Alternatives: ANOVA, Multiple Comparisons and the Higher Criticism. Annals of Statistics , 39:2533–2556, 2011.
2[2] Yannick Baraud. Non-asymptotic minimax rates of testing in signal detection. Bernoulli , 8(5):577–606, 2002.
3[3] Yannick Baraud, Sylvie Huet, and Béatrice Laurent. Testing convex hypotheses on the mean of a Gaussian vector. Application to testing qualitative hypotheses on a regression function. Annals of statistics , pages 214–257, 2005.
4[4] A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika , 98(4):791–806, 2011.
5[5] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities . Oxford University Press, Oxford, 2013. A nonasymptotic theory of independence, With a foreword by Michel Ledoux.
6[6] Jelena Bradic, Jianqing Fan, and Yinchu Zhu. Testability of high-dimensional linear models with non-sparse structures. ar Xiv preprint ar Xiv:1802.09117 , 2018.
7[7] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media, 2011.
8[8] T Tony Cai and Zijian Guo. Accuracy assessment for high-dimensional linear regression. ar Xiv preprint ar Xiv:1603.03474 , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Optimal Sparsity Testing in Linear regression Model

Abstract

1 Introduction

1.1 Minimax separation distance

1.1.1 Independent setting

1.1.2 General setting

1.2 Previous results and related literature

Signal detection.

Composite-composite testing problems and related work.

Sparsity testing in the Gaussian sequence model.

1.3 Our results

Independent setting.

General setting.

1.4 Other related work

1.5 Notation

2 Independent setting

2.1 Minimax lower bound

Proposition 1**.**

Theorem 1**.**

2.2 Testing procedures

2.2.1 Test ϕ(t)\phi^{(t)}ϕ(t) based on a l∞l_{\infty}l∞​ estimation of θ∗\theta^{*}θ∗

Proposition 2**.**

2.2.2 Test ϕ(χ)\phi^{(\chi)}ϕ(χ) based on the l2l_{2}l2​ norm of the residuals

Proposition 3**.**

2.2.3 Test ϕ(f)\phi^{(f)}ϕ(f) based on the empirical Fourier transform of the raw covariances

Proposition 4**.**

2.2.4 Intermediary regime: Test ϕ(i)\phi^{(i)}ϕ(i) based on the empirical Fourier transform of the raw covariance

Proposition 5**.**

2.2.5 Aggregated test

Theorem 2**.**

Proof.

3 General Setting

3.1 Minimax lower bound

Proposition 6**.**

3.2 Testing procedures

3.2.1 Test ϕ(u)\phi^{(u)}ϕ(u) based on a UUU-statistic.

Proposition 7**.**

3.2.2 Recovering the Δlog⁡(p)/n\Delta\log(p)/nΔlog(p)/n rate with variable selection

Theorem 3**.**

Proposition 8**.**

Corollary 1**.**

3.2.3 Aggregated tests and summary

3.2.4 An alternative variable selection procedure

Theorem 4**.**

4 Discussion

4.1 Low-dimensional problems

4.2 Sparse inverse covariance matrices Σ−1{\boldsymbol{\Sigma}}^{-1}Σ−1 and debiased Lasso

4.3 Know Σ{\boldsymbol{\Sigma}}Σ and unknown σ2\sigma^{2}σ2.

4.4 Unknown Σ{\boldsymbol{\Sigma}}Σ and known σ2\sigma^{2}σ2.

5 Proofs of the minimax upper bounds

5.1 Some results on the square-root Lasso and a simple debiased Lasso

Lemma 1**.**

Proof of Lemma 1.

5.2 Analysis of the tests ϕ(t)\phi^{(t)}ϕ(t), ϕ(χ)\phi^{(\chi)}ϕ(χ), and ϕ(u)\phi^{(u)}ϕ(u)

5.2.1 Proof of Proposition 2 (Test ϕ(t)\phi^{(t)}ϕ(t))

Lemma 2**.**

Proof of Lemma 2.

Lemma 3**.**

5.2.2 Proof of Proposition 3 (Test ϕ(χ)\phi^{(\chi)}ϕ(χ))

Lemma 4**.**

Proof of Lemma 4.

5.2.3 Proof of Proposition 7 (test ϕ(u)\phi^{(u)}ϕ(u))

Lemma 5**.**

5.3 Analysis of ϕ(f)\phi^{(f)}ϕ(f) and ϕ(i)\phi^{(i)}ϕ(i) (Propositions 4 and 5)

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

Proof of Lemma 6.

Proof of Lemma 7.

Proof of Lemma 8.

5.4 Proof of Theorem 2

5.5 Proof of Theorem 3

5.5.1 Proof of Theorem 3

Lemma 9**.**

Proposition 1.

Theorem 1.

2.2.1 Test $\phi^{(t)}$ based on a $l_{\infty}$ estimation of $\theta^{*}$

Proposition 2.

2.2.2 Test $\phi^{(\chi)}$ based on the $l_{2}$ norm of the residuals

Proposition 3.

2.2.3 Test $\phi^{(f)}$ based on the empirical Fourier transform of the raw covariances

Proposition 4.

2.2.4 Intermediary regime: Test $\phi^{(i)}$ based on the empirical Fourier transform of the raw covariance

Proposition 5.

Theorem 2.

Proposition 6.

3.2.1 Test $\phi^{(u)}$ based on a $U$ -statistic.

Proposition 7.

3.2.2 Recovering the $\Delta\log(p)/n$ rate with variable selection

Theorem 3.

Proposition 8.

Corollary 1.

Theorem 4.

4.2 Sparse inverse covariance matrices ${\boldsymbol{\Sigma}}^{-1}$ and debiased Lasso

4.3 Know ${\boldsymbol{\Sigma}}$ and unknown $\sigma^{2}$ .

4.4 Unknown ${\boldsymbol{\Sigma}}$ and known $\sigma^{2}$ .

Lemma 1.

5.2 Analysis of the tests $\phi^{(t)}$ , $\phi^{(\chi)}$ , and $\phi^{(u)}$

5.2.1 Proof of Proposition 2 (Test $\phi^{(t)}$ )

Lemma 2.

Lemma 3.

5.2.2 Proof of Proposition 3 (Test $\phi^{(\chi)}$ )

Lemma 4.

5.2.3 Proof of Proposition 7 (test $\phi^{(u)}$ )

Lemma 5.

5.3 Analysis of $\phi^{(f)}$ and $\phi^{(i)}$ (Propositions 4 and 5)

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Step $(\mathbf{Q}_{1})$ :

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Step 3. Control of $\|\mathbf{P}_{0}-\mathbf{P}_{1}\|_{TV}$ .