Optimal Sparsity Testing in Linear regression Model
Alexandra Carpentier, Nicolas Verzelen

TL;DR
This paper investigates the fundamental limits of testing the sparsity level in high-dimensional linear regression, providing minimax separation distances for different scenarios with known and unknown parameters.
Contribution
It precisely characterizes the minimax separation distances for sparsity testing in high-dimensional linear regression under various conditions, highlighting the influence of null and alternative sparsity levels.
Findings
Minimax separation distances depend on null and alternative sparsity levels.
Different scenarios show distinct separation distances based on knowledge of covariance and noise.
Both null and alternative hypotheses' sizes are crucial in the testing problem.
Abstract
We consider the problem of sparsity testing in the high-dimensional linear regression model. The problem is to test whether the number of non-zero components (aka the sparsity) of the regression parameter is less than or equal to . We pinpoint the minimax separation distances for this problem, which amounts to quantifying how far a -sparse vector has to be from the set of -sparse vectors so that a test is able to reject the null hypothesis with high probability. Two scenarios are considered. In the independent scenario, the covariates are i.i.d. normally distributed and the noise level is known. In the general scenario, both the covariance matrix of the covariates and the noise level are unknown. Although the minimax separation distances differ in these two scenarios, both of them actually depend on and illustrating that for this…
| LB : | ||
| UB : |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Optimal Sparsity Testing in Linear regression Model
Alexandra Carpentier and Nicolas Verzelen
Abstract
We consider the problem of sparsity testing in the high-dimensional linear regression model. The problem is to test whether the number of non-zero components (aka the sparsity) of the regression parameter is less than or equal to . We pinpoint the minimax separation distances for this problem, which amounts to quantifying how far a -sparse vector has to be from the set of -sparse vectors so that a test is able to reject the null hypothesis with high probability. Two scenarios are considered. In the independent scenario, the covariates are i.i.d. normally distributed and the noise level is known. In the general scenario, both the covariance matrix of the covariates and the noise level are unknown. Although the minimax separation distances differ in these two scenarios, both of them actually depend on and illustrating that for this composite-composite testing problem both the size of the null and of the alternative hypotheses play a key role.
1 Introduction
In the last decade, a lot of effort has been devoted to developing sound statistical methods for high-dimensional data. Most of the estimation procedures rely on the assumption that the parameter of interest has some possibly unknown structure. A prominent example is the high-dimensional linear regression problem where it is usually assumed that the regression parameter is sparse [7]. Despite the pervasiveness of the sparsity assumption in the literature, very few contributions challenge this assumption.
In this work, we tackle the largely ignored problem of assessing the sparsity of the regression parameter. Henceforth, we consider the random design high-dimensional linear regression model
[TABLE]
where the unknown parameter belongs to , the noise vector follows a standard normal distribution and where the rows of are i.i.d. sampled according to the normal distribution . For a given integer , we study the problem of testing whether the vector has at most non-zero components.
1.1 Minimax separation distance
Before discussing our contribution, we first formalize the sparsity testing problem. For a vector , denotes its number of non-zero entries. Then, given a non-negative integer , write for the set of -sparse vectors . Rephrasing our aim, we want to test whether belongs to .
In order to assess the quality of a testing procedure, we rely on the framework of minimax separation distances [29] which is described in the following paragraphs. Let denote the distance in . For any , stands for its distance to the set of -sparse vectors. Intuitively, any -level test of the null hypothesis cannot reject the null with high probability when is too small. In this work, we aim at characterizing the smallest distance , such that there exists a test achieving a small type I error probability and rejecting the null with high probability whenever is larger than . These informal definitions are made precise in the next subsection. In the sequel, stands for the distribution of in (1).
In high-dimensional linear regression, the intrinsic difficulty of estimation or testing problems sometimes depends on some specific features such as the knowledge of the noise level or the knowledge of the distribution of the design. In this work, we focus on two emblematic settings. In the independent setting, we assume that the covariates are independent () and the noise level is known. In the general setting, both the covariance of the covariates and the noise level are unknown.
1.1.1 Independent setting
Fix a positive integer , we consider the alternative hypothesis where is -sparse. Given and a test , we introduce its risk as the sum of the type I and type II error probabilities
[TABLE]
where we only consider parameters in the alternative hypothesis that lie at a distances higher than from the null. For a fixed (known) , the separation distance of is the largest such that its risk is higher than , i.e. . Parameters lying at a distance larger than from the null are therefore detected with probability higher than 1- by . Finally, the minimax separation distance is
[TABLE]
where the infimum is taken over all tests .
1.1.2 General setting
In the general case, neither the covariance matrix of the covariates, nor the noise level is known. We only assume that the the eigenvalues of are bounded away from zero and from infinity. Respectively write and for its smallest and largest eigenvalues. Given , define
[TABLE]
Fix . In this general model, the risk of a test is now taken as
[TABLE]
Since both and are unknown, we evaluate the type I and type II error probabilities uniformly over all and all . The class of covariance matrices is constrained in in order to preclude too difficult settings where the eigenvalues of differ too much to each other. Then, as in the previous subsection, the separation distance of a test is and the minimax separation distance in the general setting is defined by
[TABLE]
In this work, we address both independent and general settings. More specifically,
- (i)
We characterize the minimax separation distances in both the independent () and the general () settings by providing upper and lower bounds that match (up to a polylogarithmic loss in some regimes). 2. (ii)
We introduce computationally feasible testing procedures that (almost) simultaneously achieve this minimax separation distance over all .
1.2 Previous results and related literature
Before further describing our contribution, we first discuss related results in the literature.
Signal detection.
The signal detection problem which amounts to testing whether is a special instance of the sparsity testing problem (corresponding to ). Signal detection in the Gaussian vector model (which corresponds to an orthogonal design) has been extensively studied [29, 2, 23, 18, 19] in the last fifteen years. More recently, this problem has also been investigated in the random design linear regression model [28, 1, 16].
To simplify the discussion, let us consider the high-dimensional setting where for some fixed constant . Then, one can deduce from [28] that the minimax separation distance in the independent setting satisfies
[TABLE]
where means that there exist positive constants and (possibly depending on and ) such that for all , , and . For , this separation distance is achieved by measuring the raw correlations between the response and the covariates and rejecting when too many of these correlations are unusually large. This can be done through the Higher-Criticism scheme [28, 1]. For denser alternatives (), we start from the identity (and the are i.i.d.). Hence, a test rejecting when the empirical mean of is significantly larger than achieves the optimal squared separation distance of order [28, 1]. In the specific regime where is of the same order as , and is close to , the analysis has to be refined, see [16].
In the general setting (unknown and unknown ), it has been proved in [45] that,
[TABLE]
However, for sparse alternatives, the corresponding test in [45] relies on a type variable selection method and has therefore exponential computationally complexity. For denser alternatives , the lower bound entails that the minimax separation distance is large whenever . Comparing both the independent and the general settings, we observe that the separation distance is significantly larger in the general setting for dense alternatives .
Composite-composite testing problems and related work.
An important difference between the signal detection () problem and the general sparsity testing problem () is that, in the latter, the null hypothesis is composite, thereby making the analysis of the problem more challenging. Up to our knowledge, the analysis of such composite problems has been considered only in a few work [35, 3, 20, 15], although the problems of constructing adaptive confidence regions (e.g. [12, 13, 27, 39, 9, 8]) or of functional estimation (e.g. [38, 25, 14, 11, 10]) are also related to such testing problems.
In particular, Nickl and Van de Geer [39] consider the problem of constructing adaptive and honest confidence sets for in the linear regression model (1) with known variance . To achieve adaptivity to the unknown sparsity of , Nickl and van de Geer need to test hypotheses of the form . Following the so-called “infimum testing” principle, described in a systematic way in [26], they consider the statistic . This statistic corresponds to the infimum of the empirical variance when one corrects by a -sparse vector . Under the null, this statistic is not much larger than the noise level . This leads them to derive
[TABLE]
for some . Comparing this bound with its counterpart in the signal detection problem (), we observe an increase by an additive term accounting for the complexity of the null hypothesis.
Up to our knowledge, it is still unknown whether the upper bound of Nickl and van de Geer is optimal (that is whether actually depends on ). In this manuscript, we answer this open question, this for all and .
Sparsity testing in the Gaussian sequence model.
The Gaussian sequence model corresponds to case and . In [17], we have pinpointed the minimax separation distances for all and both when is known and is unknown. In particular, the optimal separation distance actually depends on the size of the null hypothesis for large but is significantly smaller than what is obtained by infimum tests strategies such as those in [26].
Generally speaking, [17] is closely related to the aims and results of this paper, but there is a significant challenge in adapting the results in [17] which are available for the Gaussian sequence setting, to the linear regression setting.
Related to this problem, some authors [11, 33, 34, 10] have considered the problem of estimating in the Gaussian sequence model in a Bayesian framework where all ’s are sampled according to some mixture distribution. Although some of the ideas can be borrowed from their work, this Bayesian setting is quite different (see [17] for a discussion).
1.3 Our results
In this paper, we characterize the minimax separation distances and . To alleviate the discussion, we restrict ourselves throughout this paper to the high dimensional regime where is an arbitrarily small absolute constant.
Independent setting.
We establish matching (up to a multiplicative constants depending on ) upper and lower bounds for , this for almost all values of and ; see Table 1 for a summary of these results. An aggregated test is also shown to simultaneously achieve the optimal separation distance for all , entailing that adaptation to the sparsity is possible for this problem. In our exhaustive picture of , some of the regimes in and are addressed by simple extensions of signal detection tests. However, other regimes turn out to be more challenging and require novel ideas. In what follows, we briefly mention these original aspects.
- •
We prove that, when , then the testing problem becomes extremely difficult, in the sense that the separation distance is very large. For , this separation distance is even infinite. This is not unexpected since identifiability problems arise in this regime.
- •
For moderate and large , we prove that the upper bound of Nickl and Van de Geer [39] turns out to be optimal, i.e. the squared minimax separation distance is achieved by their infimum test and is of the order of . The general idea is to reduce the problem of sparsity testing (with known variance) to a detection problem with unknown variance.
- •
For larger (where is an arbitrarily small absolute constant, and where is an absolute constant), then both upper and lower bounds are new. The lower bound is based on moment matching strategies and best polynomial approximation akin to those of [17] in the Gaussian model. But the derivation is significantly more involved in the regression setting. For small (), an optimal test is built using any estimator of achieving a small error (see e.g. [31, 51, 44, 32]). The test simply rejects when this estimator has more than unusually large entries. For denser alternatives (), the approach is quite different. We build a statistic based on the empirical Fourier transform of some correction of the raw correlations between the covariates and the responses . This approach is reminiscent of sparsity estimators in [33, 17] in the Gaussian sequence model.
General setting.
We derive lower and upper bounds of the minimax separation distance . These bounds match except in the large and regime, where there is a mismatch. See Table 2 for a summary of the results. As in the independent setting, we emphasize below the most novel ingredient of our analysis.
- •
Achieving the optimal squared distance could be easily done if one has access to an estimator whose distance to is less than with high probability. However, such an estimator is unknown for general covariance matrices . For , it is even proved that no such estimator exists [9]. Here, we first select a reasonable candidate for the support of by relying on the non-convex penalized least-square estimator MCP [49]. Then, a test based on the restricted least-squares estimator applied to the selected subset is shown to achieve the desired separation distance. We also introduce an alternative test based on an iterative application of a projected version of the square-root Lasso.
1.4 Other related work
Two recent works [52, 30] have among other things consider general testing problems that encompass the sparsity testing problem. These two contributions assess the quality of their tests according to the separation distance (instead of as we do here) to the null hypothesis, i.e. . In their setting, the covariance of the covariates is unknown but its inverse is assumed to be sparse (each row of has at most than non-zero entries) so that it can be reasonably well estimated. In that setting, the computationally feasible test in [52] has a small type II error probability when is much smaller than and when .
In [30], Javanmard and Lee use a test based on the debiased Lasso. It achieves a small type I error probability. Whenever is much smaller than , and also , its type II error probability is also small. Translating these results in the separation distance setting, we observe that this test achieves a squared separation distance which, in view of Table 2, is optimal for small . Their approach could be used instead of ours in their setting. However, we stress out that they achieve this bound to the price of considering a much more restricted class of covariance matrices than - they need that each row of is at most sparse, while contains all matrices that have their spectrum contained in .
A recent line of work has focused on testing the nullity of a given subset of coordinates of (e.g. [53, 54, 6, 32, 44, 51, 9]), but both the settings and the methodology are quite different.
1.5 Notation
For any positive integer and , we write for the support of a vector . For and , we write for the vector in whose values outside have been set to [math]. For a vector , stands for its -th largest (in absolute value) entry. Given , stands for its complement.
In the sequel, , , denote numerical positive constants that may vary from line to line. Given some quantity , stands for a positive constant possibly depending on that may vary from line to line. Underlined constant such as , do not vary in the paper.
Let be two functions that may depend on several quantities such as and let . We write (resp. ) if there exists a constant that depends only on (resp. two constants that depend only on ) such that (resp. such that ).
For , (resp. ) stands for the largest (resp. smallest) integer which is less (resp. greater) or equal to . Also, stands for the binary logarithm. Finally, stands for the tail distribution function of a standard normal distribution.
2 Independent setting
To simplify the notation, we denote the distribution of the data when is the identity matrix. Recall that we are especially interested in the high-dimensional setting. This is why we shall sometimes assume that or even for some arbitrarily small.
2.1 Minimax lower bound
As a starting point, we prove that, when the size of the null hypothesis is too large, consistent testing is impossible. Indeed, assume that . Then, for any such that , there exists that perfectly fits this sample () and it is therefore impossible to decipher whether is -sparse or not. The following proposition formalizes this observation.
Proposition 1**.**
If , then, for any , and , we have .
In the sequel, we therefore restrict ourselves to the case where . The next theorem provides a lower bound for the minimax separation distance of the sparsity testing problem.
Theorem 1**.**
Assume that . There exist positive numerical constants – such that the following holds for all and for all . For , one has
[TABLE]
Furthermore, if and , then
[TABLE]
for all .
In particular, (7) entails that the sparsity testing problem turns out to be extremely difficult in the regime (at least when ).
The different regimes in (6) will be discussed together with the upper bounds at the end of the section. Let us shortly comment on the proof of Theorem 1. The functional is (almost) nondecreasing with respect to . As a consequence, the lower bound is a straightforward consequence of the analysis of the detection problem e.g. in [28].
The two lower bounds \frac{1}{\sqrt{n}}+\frac{k_{0}}{n}\log\big{[}1+\frac{\sqrt{p}}{k_{0}}\big{]} and (7) are based on a reduction argument. The proof stems from the fact it is impossible to decipher between two sets of hypothesis if these two sets of hypotheses are almost indistinguishable from a third party hypothesis. Here, the third party hypothesis corresponds to and a tailored noise variance . Plugging minimax lower bounds for detection with unknown variance allows us to get the desired rate. See the proof for more details.
In fact, it is most challenging to prove the minimax lower bound in the regime as we cannot apply any reduction technique to signal detection problem and we need to take into account that both the null and the alternative hypotheses are composite. As for the Gaussian sequence model [17], we use a general moment matching technique [38], but the non-orthogonal design matrix makes the computations more tricky.
2.2 Testing procedures
In this subsection, we fix and . We now introduce three testing procedures whose combination leads to matching the previous minimax lower bound.
Without loss of generality, we assume that is divisible by and we divide the sample into three subsamples and and of equal size . For , we write for the probability according to the -th sub-sample. In fact, some of the tests introduced below only use the first two subsamples. Nevertheless, we use three subsamples throughout the paper to simplify the presentation.
To characterize the performances of the testing procedures, we shall control the type I error probability uniformly over the null hypothesis and control the type II error probability on some ’large’ parameter subset of the alternative. To simplify the statements of the results we shall refer to these two properties as (P1) and (P2) as defined below.
Property P1. A test satisfies (P1[]) if its type I error probability is less than or equal to , that is
Property P2. A test satisfies (P2[]) on a set if its type II error probability is uniformly less than or equal to , uniformly on , that is
Following the discussion in the previous subsection, we restrict our attention to sparsities that are less than . This is formalized in the following condition () where are numerical constants (respectively small enough for and large enough for ) whose values are constrained in Propositions 2–5.
()
(k_{0}\vee 1)\log(\frac{p}{\alpha})+\log^{2}\big{(}\frac{p}{\alpha}\big{)}\leq\underline{c}^{(\bf A)}n and
2.2.1 Test based on a estimation of
The first test aims at detecting whether contains at least ’large’ entries. In order to do so, we need to build a reasonable estimator of . Note that estimators based on the debiased Lasso have already been proved to achieve such a property (see e.g. [32]) in some settings. For the sake of completeness and as a gentle introduction to more challenging settings, we introduce here a slightly different estimator.
As a first step, we rely on a square-root Lasso [4] estimator based on the first subsample. From the design matrix , we build its column normalized modification by
[TABLE]
Set . The square-root Lasso estimator is then defined by
[TABLE]
In this section, we could replace the square-root Lasso estimator by a classical Lasso estimator since the noise level is known. Also, the design is normalized for the purpose of simplifying some proof arguments, but the results remain valid (with slightly different constants) with the unnormalized design matrix .
Then, given , we use the second sample to improve the estimation of . The estimator is based on the empirical raw correlations between the covariates and the residuals.
[TABLE]
Since the design is independent, is an unbiased estimator of . It is not hard to show (see the proof of the next proposition) that, under weak assumptions, this estimator satisfies has with high probability. This is why we define the test rejecting the null if \big{|}(\widetilde{\theta}_{\mathbf{I}})_{(k_{0}+1)}\big{|}\geq\underline{c}^{(t)}\sigma\sqrt{\log(p/\alpha)/n}, where a suitable value for the constant is defined in the proof of Proposition 2 below. This test is powerful when contains at least large entries. This is formalized in the following proposition.
Proposition 2**.**
There exist numerical constants , and such that the following holds under Condition (). The test satisfies (P1[]) and (P2[]) on the collections
[TABLE]
with .
Again, we emphasize that similar performances are achieved by the debiased Lasso test of Javanmard and Lee [30].
2.2.2 Test based on the norm of the residuals
The second test is also simple. We heavily rely on the knowledge of the noise level . In the detection setting , [28, 1] consider a test rejecting the null when the squared norm is large compared to one. Indeed, in expectation, is equal to . Here, we have to adapt this statistic as is unknown under the null.
First, we project the square-root Lasso estimator onto the parameter set corresponding to the null hypothesis. More precisely, we define . In other words, is obtained from by thresholding its smallest entries to zero. Then, given , we use the second sample to assess whether is significantly different from . Define the residuals vectors and, for , the statistic .
Take the threshold , we consider the test rejecting the null hypothesis when , where the numerical constant is introduced in the proof of the following proposition.
Proposition 3**.**
There exist numerical constants and and such that the following holds under Condition (). The test satisfies (P1[]) and (P2[]) on the collection
[TABLE]
It turns out that a combination of and is matching the minimax lower bound of Theorem 1 when . For larger null hypotheses, we need to rely on more intricate tests that are discussed in the next section.
2.2.3 Test based on the empirical Fourier transform of the raw covariances
In the Gaussian sequence framework ( and ), [17] have recovered the optimal separation distance using test based on the empirical Fourier transform of the data. In this section, we adapt this approach in the linear regression model.
Conditionally to , it is shown in the proof of Proposition 4 below that the normalized raw covariances follow a normal distribution with mean and variance . Since is concentrated around and assuming that is small compared to , this implies that the raw covariances are almost distributed as a normal distribution with mean and covariance . This observation leads us to adapt the Fourier tests of [17] in our setting through raw covariances.
The purpose of the empirical Fourier transform statistic considered in [17] (but see also [33, 34] for previous work), is to approximate the discontinuous function . First, introduce, for , the function
[TABLE]
For , standard computations lead to . In particular, the function takes values in with and (see [17]). In some way, is a smooth approximation to . The larger is, the closer is to the indicator function. However, exhibit a higher variance for large .
The conditional distribution of is close to a normal distribution with mean and variance-covariance matrix , provided that is small compared to . Hence, it would be tempting to use a statistic of the form , which in expectation would be close to , which in turn would approximate . Unfortunately, large coordinates may perturb the concentration of the statistic since the true conditional covariance of is . To address this technical issue, we first correct by removing its large coefficients.
As in Subsection 2.2.1, the first two samples are respectively dedicated to building the Lasso estimator and the debiased estimator . If , then the test introduced below rejects the null hypothesis, otherwise we define as
[TABLE]
In Subsection 2.2.1, we argued that, with high probability, . As a consequence, and the support of is included in that of
Finally, we use the third subsample to compute the corrected raw covariances with relative to the linear regression model with parameter . Then, following the above heuristic explanation, we consider the statistic
[TABLE]
with tuning parameter . For in the support of , we are already confident that is non zero and we do not have to rely on . Finally, the test rejects the null when with .
In comparison to the original statistic of [17] for the Gaussian sequence model, we use here a slightly smaller tuning parameter and the threshold has an additional term .
Proposition 4**.**
There exist constants , , , and such that the following holds under Condition (). The test satisfies (P1) and (P2[]) on the collection of parameters satisfying , d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\leq\sigma^{2} and at least one of the two following conditions.
[TABLE]
The test rejects the null hypothesis when there are many small non-zero coefficients in . In particular, if contains coefficients of order , then the null hypothesis is rejected with high probability. Note that is much smaller than the value needed to recover the position of these non-zero coefficients, which is of the order . This behavior is reminiscent of the minimax lower bound in Theorem 1, where the squared separation distance is proven to be at least of the order for .
When there are a few entries in that are neither large nor small - see below for more precisions, it turns out that only matches the minimax lower bound up to some multiplicative factor. To address this issue we need to introduce an additional test .
2.2.4 Intermediary regime: Test based on the empirical Fourier transform of the raw covariance
In this subsection, we focus on entries that are neither large (with respect to ) as in the analysis nor small (with respect to ) as in the analysis of . This setting turns out to be relevant for large only and we assume henceforth that . As in the previous section, we adapt a test from [17] in the Gaussian sequence setting by applying the empirical Fourier transform to the raw covariances.
Given two tuning parameters and , define the function
[TABLE]
and the statistic
[TABLE]
In order to get a grasp of this statistic let us consider the expectation of for . Simple computations (see [17]) lead to , which for large , is close to . Thus, in contrast to the population function introduced in the previous subsection, which converges to at a quadratic rate, this function converges to one at an exponential rate, thereby better handling moderate values of . The downside of using this statistic is that does not lie in .
The test is an aggregation of multiple tests based on the statistics for different tuning parameters and . Define and the dyadic collection where . Note that is not empty if and is large enough. Given any , define
[TABLE]
Then, the test rejects the null hypothesis if, for some ,
[TABLE]
In comparison to the test in [17], the collection of tuning parameters is slightly narrower and the threshold has an additional corrective term of the order of .
Proposition 5**.**
There exist positive constants such that the following holds under Condition (). The test satisfies (P1) and (P2[]) on the collection of parameters satisfying , d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}\leq\sigma^{2} and
[TABLE]
In Comparison to Condition (14) for Proposition 4, is possibly much smaller than for in the regime where .
2.2.5 Aggregated test
To conclude this section, we evaluate the performances of the combination of all the previous tests. In fact, is only defined in the large regime. We take the convention that is a trivial test that always accepts the null hypothesis in the small regime. Consider the aggregated test
[TABLE]
The last test is introduced for technical purpose to handle very dense alternatives ().
Theorem 2**.**
Let and . There exists positive constants and such that the following holds. Assume that and that Condition () is satisfied. Define
[TABLE]
The test satisfies (P1) and (P2[]) on the collection of parameters
[TABLE]
with .
The case is a simple corollary of the previous results, whereas the dense case requires further work.
To further compare this result with the minimax lower bound of Theorem 1, we assume that for some . Recall that we also suppose . From Theorems 1 and 2, we deduce that
Case 1: with an arbitrary .
[TABLE]
Case 2: with an arbitrary . For any arbitrarily small, we have
[TABLE]
and that all these bounds are simultaneously achieved by the test . As a consequence, is simultaneous minimax over all and all except in the regimes when is close to or when is close to , in which case, there is possibly a polylogarithmic difference between the minimax lower and upper bounds.
Proof.
If and , then . Hence (6) in Theorem 1 ensures that the square minimax separation distance is at least of the order of . The first term is (up to numerical constants) larger than second one when . For , we have . Hence, the square minimax separation distance is at least of the order of which matches the upper bound of Theorem 2. If , , and , then and Theorem 6 ensures that the square minimax separation distance is at least of the order of , matching again Theorem 2. When , , and we deduce from Theorems 1 and 2 that the square minimax separation distance is of order of . ∎
Let us summarize the different regimes
- •
If is small - first result in Cases 1 and 2 - then the squared minimax separation distance () is the same as for signal detection (). The upper bound can be achieved using any -consistent estimator of and simply counting the number of its large entries. In the independent setting such estimator is easily built using the raw correlation () between the variables and the response. Alternatively, one could use the debiased Lasso [31, 51, 44, 32] which is valid for a wider class of .
- •
If is large and is small - second result in Case 1 - then the squared minimax separation distance can be understood as the sum of the quantity arising in signal detection and the complexity of the null hypothesis. The matching upper bound is achieved by computing the norm of the residuals when plugging a suitable estimator of . The upper bound was already obtained in [39] (for a computationally inefficient method) but the matching minimax lower bound is new.
- •
Finally, if is large and is large - second result in Case 2 - then the minimax separation distance is highly non standard and depends on the complexity of the null hypothesis. Both the lower and upper bound are new. In some way, they both draw inspiration from the analysis [17] of the same problem in the Gaussian sequence framework.
In this paper, we focused on recovering the minimax separation distance in the the high-dimensional setting, namely we require , where is an arbitrarily small universal constant. Aside from this restriction, there are two gaps in our analysis:
- •
Fist, when and , our minimax lower bounds in Theorem 1 imply that the testing problem is almost impossible. However, for , we did not manage to prove similar lower bounds. We conjecture that, for and , is huge, but we did not manage to prove it.
- •
Some poly-log terms mismatch between the upper and lower bounds arise when is close to - e.g. for some and when gets close to from below - i.e. for some arbitrarily small universal constant . In that regime, we could improve our upper bounds by adapting some higher-criticism [23] procedures as it was done in the sequence model [17]. However, even with this new procedure this would not completely close the gap. We conjecture that our minimax bound (6) is not completely sharp in that regime (see its proof for a tentative explanation).
3 General Setting
In this section, we focus on the general setting where is unknown and is only assumed to belong to some class (4) for some . The noise variance is also assumed to be unknown.
3.1 Minimax lower bound
Obviously, is at least as large as since the covariance matrix is unknown and belongs to . Therefore, Theorem 1 in the previous section provides a lower bound on . It turns out that that this lower bound is sometimes loose and that the general setting is actually more challenging in some regimes as shown by the following proposition.
Proposition 6**.**
Assume that . There exist positive numerical constants and such that for and for all , one has
[TABLE]
with .
In fact, this result is a combination of Theorem 1 together with known minimax lower bounds for the detection problem () with unknown variance [45, 46].
In comparison to the independent setting, one cannot achieve anymore the rate . Most importantly, the testing problem becomes almost impossible for dense alternative () in the high-dimensional regime .
3.2 Testing procedures
We cannot rely anymore on the test as the noise level is unknown nor on and as their reconstruction relies on the independence of the covariates.
As in the previous sections we introduce two properties (gP1) and (gP2) characterizing the type I and II error probabilities in this setting where the noise level and the covariance matrix are unknown.
Property gP1. A test satisfies (gP1[]) if its type I error probability is less or equal to , that is .
Property gP2. A test satisfies (gP2[]) on the collection of parameters if its type II error probability is uniformly less or equal to that is .
Note that that in the above bound is rescaled by for homogeneity purpose. As in Section 2, we restrict our attention to sparsities that are less than . The numerical constants and in the following condition are introduced in the proof of Proposition 7 and Corollary 1.
([)
(k_{0}\vee 1)\big{[}1+\log(p/\alpha)\big{]}+\log^{3}\big{(}\frac{1}{\alpha}\big{)}+\log(p)\log(\frac{1}{\alpha})\leq\underline{c}_{\eta}^{(\bf B)}n and
In this section, we divide the sample in two subsamples and of equal size . As previously, we shall combine several tests to match the minimax lower bounds.
3.2.1 Test based on a -statistic.
The first test is specific to the moderate regime . For known , we introduced in the previous section a statistic relying on the observation that estimates well . Then, relying on a good -sparse estimator of and computing the square norm of the residuals, we estimate , which under the null, should be small. Here, we follow the same strategy by considering an estimator of the signal strength, still valid for unknown .
In [22], Dicker tackled the problem of estimating the signal strength in the setting where and is unknown. This led him to introduce the -statistic , which is unbiased and consistent. For general , this statistic has later been shown to be concentrated around the quadratic form (see [47, Sect.2.1]). As a consequence, one can rely on it to test the nullity of .
For composite null hypotheses, we use to build as in Subsection 2.2.2 and then compute the residuals with respect to the the second sample, . Finally, we define the normalized -statistic by
[TABLE]
Conditionally to , is the response of a linear regression model with parameter , variance , and random design . Hence, the second moment of each entry of equals and is therefore close to . Intuitively, the statistic is therefore expected to be close to
[TABLE]
so that a large value for suggests that is significantly different from a sparse vector. Setting the threshold
[TABLE]
we consider the test rejecting the null hypothesis when .
Proposition 7**.**
There exist three constants , and such that the following holds under Condition (B() and if 2n\leq p\leq c_{\eta}n^{2}\log^{-1}\big{(}\frac{2}{\alpha\wedge\beta}\big{)}. The test satisfies (gP1[]) and (gP2[]) on the collection
[TABLE]
3.2.2 Recovering the rate with variable selection
To achieve the rate, it would suffice to estimate at the rate as we did for the test in the previous section. However, we are unaware of any estimator achieving this rate uniformly over the class of covariance matrices . For , it is even proved that no such estimator exists [9].
Here, we adopt another strategy. We shall first estimate the support of and count the number of large entries of the least-squares estimator of restricted to the estimated support . Of course, if with high probability, then the restricted least-squares estimator (see below for a definition) will be close to in norm. Unfortunately, it is impossible for an estimator to estimate exactly the support , especially when contains arbitrarily small coordinates.
This is why we shall require that the estimator satisfies a weaker property. Given , let be the number of small but non zero coefficients of . Below , , refer to three positive quantities. Recall that is the complement of .
Property (). A (possibly random) set is said to satisfy this property if
[TABLE]
In other words, the cardinal of is not too large compared to the sparsity of and the square norm of outside is at most as large as that of the small entries of . Observe that the large entries of are not required to belong to .
Then, given a set , we consider the restricted least-square estimator and the plug-in variance estimators
[TABLE]
For a vector and , is the number of entries of larger or equal (in absolute value) than . Then, we define the test rejecting the null if and only if , which means that contains at least large entries.
Theorem 3**.**
There exist constants and such that the following holds for any . Consider any , , , , satisfying c\big{[}\mathbf{a}_{2}\|\theta^{*}\|_{0}+\log\left(\frac{4}{\delta}\right)\big{]}\leq m, and satisfying . Taking
[TABLE]
we have \mathbb{P}^{(0)}_{\theta^{*},\sigma,{\boldsymbol{\Sigma}}}\big{[}\phi^{(th)}[S,\underline{c}_{*}]=1\big{]}\leq\delta if . Besides, \mathbb{P}^{(0)}_{\theta^{*},\sigma,{\boldsymbol{\Sigma}}}\big{[}\phi^{(th)}[S,\underline{c}_{*}]=1\big{]}\geq 1-\delta if and
[TABLE]
If satisfies (), then the test with a suitable tuning parameter has a controlled type I error probability. Besides, its square separation distance over is (up to constants depending on and ) of the order of .
In view of this general result, it suffices to build an estimator of the support based on that satisfies () for small to get the squared separation distance .
Unfortunately, the support of the Lasso estimator is only proved to satisfy the first part of property (). Its number of false positives is at most of the order of , see [50]. It turns out that the second part of the property has only recently been proved to be achieved by non-convex penalized estimators, see [24] such as MCP estimator [49].
As in the previous section, we consider the column normalized design . Given and two tuning parameters , , the MCP criterion is defined by
[TABLE]
where . Local minimizers of the MCP criterion can be efficiently computed using the PLUS Algorithm from [49] or by approximate regularization path by [48]. It turns out that non-convex penalized estimators suffer from less bias than Lasso estimators.
Consider the square-root Lasso estimator (8) with and the plug-in variance estimator . Define the tuning parameters
[TABLE]
for some constants and whose range of possible values follows from [49] and [24]. The following proposition is a consequence of Corollary 1 in [24] together with Theorem 6 in [49].
Proposition 8**.**
There exist constants , , – such that the following holds for any with . With probability higher than , the support of any stationary point of the criterion (28) satisfies .
A similar result holds if we use the non-convex SCAD penalty instead of MCP from [24].
Now, we can plug the support estimator into the test with a suitable constant . The following result is a straightforward consequence of Theorems 3 and Proposition 8 and its proof is therefore omitted.
Corollary 1**.**
There exist constants , , and such that the following holds under Condition (B()). The test satisfies (gP1[]) and (gP2[]) over the collections
[TABLE]
for all .
3.2.3 Aggregated tests and summary
Consider some . Since the performances of the test are only assessed in the regime p\leq c_{\eta}n^{2}\log^{-1}\big{(}\frac{2}{\delta}\big{)} ( is introduced in Proposition 7), we combine the tests and only in that regime. For larger , we solely use . Combining Proposition 7 and Corollary 1 to evaluate the separation distance of the aggregated test and comparing them with the minimax lower bounds of Proposition 6, we obtain the following characterization - note that we assume here that we are in the high dimensional regime, i.e. where is an arbitrarily small absolute constant.
Case 1: with an arbitrary but fixed and .
[TABLE]
for .
Case 2: with an arbitrary but fixed and .
[TABLE]
for .
Case 3: . For any and smaller than , we have
[TABLE]
whereas the problem become much more difficult for larger or .
In conclusion, the aggregated test achieves the minimax separation distance except in the regime where where there is gap between the two squared rates.
Summing up our findings, we observe that
- •
For sparse alternatives (small ) - first result in Cases 1 and 2 and result in Case 3 - then the minimax separation distance is analogous to that of signal detection (), i.e. of order . It would be straightforward to achieve this distance if we had at our disposal a -consistent estimator of . However, this is not possible over the class of ( unknown in this class) [9, 32]. This is why we use a slightly different approach that focuses on selecting most of the relevant features (as in Property (25)).
- •
If is large and is small - second result in Case 1 - then the squared minimax separation distance is of the order of and is the same as for signal detection (). It is achieved by a -statistic originally introduced for estimating when [22, 47].
- •
If is large and is large - second result in Case 2 - then the lower bound on the minimax separation distance reflects the complexity of the null hypothesis. The lower bound is the same as in the independent setting, described in the previous section. The upper bound is based on the same -statistic as in the previous case. Unfortunately the upper and lower bounds only match up to factor. In this general setting, we doubt that adapting the Fourier statistic of the previous section is possible and we conjecture that the squared separation distance is actually of the order .
- •
Finally, we emphasize that, for large compared to and , the optimal separation distance is huge (Proposition 6). Without further assumptions, it is therefore almost impossible to test whether is -sparse or if is a dense vector when . This result is in sharp contrast with the independent setting.
3.2.4 An alternative variable selection procedure
In the previous section, we established that the test applied to the support estimated by the MCP estimator achieves the square separation rate . Here, we introduce an alternative to the concave penalized estimator MCP based on simple iterations of the thresholded square-root Lasso.
Starting from , the algorithm builds a subset of variables iteratively from a subset of variables. It is done by applying a thresholded square-root Lasso to the data projected on the orthogonal of the variables in . Then, is the concatenation of and the variables selected by the thresholded square-root Lasso. The procedure stops after approximately iterations, and returns the current subset. The general idea is to iteratively remove non-zero coordinates of so that the projected square-root Lasso estimator is less perturbed by large coordinates of .
We need to introduce some notation. Define . Assume without loss of generality that is an integer. We divide the sample into subsamples of size . Given a matrix and some subset , we write for the matrix defined by . Given and any , define the subspace of and an matrix (measurable with respect to ) whose corresponding linear application is null on and maps isometrically the orthogonal of to .
Next, we define the Thresholded square-root Lasso estimator. Let and let . Given a matrix and a size vector , we write for the subdesign matrix of where its null rows have been removed. Then, stands for the square-root Lasso estimator (see Equation (8)) of with parameter . For the purpose of notation, we consider that and that its entries corresponding to null rows of are equal to zero. Using the plug-in variance estimator , we define the thresholding modification of such that
[TABLE]
where the constant is introduced in Lemma 1.
The set is constructed as follows. We start with the empty support . At each step , we project both and along the space spanned by the variables in . Then, we apply thresholded square-root Lasso to these projected data to select new variables. Finally, is the last set .
Theorem 4**.**
There exist constants and such that the following holds for any , and any satisfying
[TABLE]
With probability higher than , the estimator satisfies .
It turns out that satisfies the desired property but with and that are logarithmically large. As a consequence, the squared separation distance of the corresponding test with is of order , which is optimal up to an additional term in the regime .
4 Discussion
In this section, we briefly discuss several related problems.
4.1 Low-dimensional problems
Although some of our results are valid in a low-dimensional setting, we focused our attention on pinpointing the minimax separation distance in a high-dimensional regime which is arguably the most interesting one. Let us briefly discuss the low dimensional regime . In the independent setting, the main difference is that the rate can be improved to by considering the ordinary least-square estimator and computing its -norm when its largest entries are removed. In the general setting, we can recover similar upper bounds as in Section 3, but with much simpler procedures based on the ordinary least-squares estimator.
Between these two regimes, the medium-dimensional case where and are of the same order is technically challenging. Our upper bounds and lower bounds only match up to polylogarithmic factor. Deriving the sharp minimax separation distance requires further work.
4.2 Sparse inverse covariance matrices and debiased Lasso
Consider an intermediary setting where both and are unknown but is also restricted to have less than non-zero entries on each rows. In this setting, the minimax lower bounds of Proposition 6 in the general setting turn out to be still valid. Indeed, the proof of Proposition 6 holds in the simpler setting where and is unknown. As in the general setting, the upper bound is achieved by the polynomial time -statistic of Section 3.2.1. In contrast, achieving the separation distance in the small regime is now much easier than in the general setting. Whereas we introduced a refitted least-square estimator combined with the non-convex MCP regularized estimator, one can now alternatively rely on the debiased Lasso method [31, 51, 44, 32] to obtain a -consistent estimator of and then simply count the number of its large entries. This was already done in [30] as discussed previously.
4.3 Know and unknown .
Consider the intermediate scenario where is the identity matrix, but is unknown. As explained in the previous subsection, the minimax lower bounds of Proposition 6 stated for the general setting are still valid in this intermediate scenario. Obviously, we can also apply the testing procedures of Section 3.
However, in the general scenario, our lower and upper bounds are only matching up to a factor in the large , large setting. More specifically, when (for some ) and , the lower bound of Proposition 6 is of order whereas Proposition 7 provides an upper bound of the order of .
It turns out that, in this intermediate scenario, the gap is easily closed by adapting the Fourier-based test and introduced in Section 2. Indeed, the only place where the knowledge of is necessary in these two tests is in the definition of the pre-estimator which is a thresholded version of (9). If we replace in this threshold by the plug-in estimator of the variance based on the square-root Lasso and if we increase some constants, this modification of the tests and does not depend anymore on . Besides, one can easily check that (up to some changes in the numerical constants) Propositions 4 and 5 are still valid for these tests.
4.4 Unknown and known .
In this case, we can improve the upper bounds of the general case by adapting the test from Section 2. Indeed, the statistic is now centered on on the class of . Hence, the corresponding test achieves a squared separation distance of the order of . The main difference with the independent case is that we are not able to adapt the Fourier-based test and to unknown . In regimes where and , there is therefore a gap between our upper and lower bounds.
5 Proofs of the minimax upper bounds
5.1 Some results on the square-root Lasso and a simple debiased Lasso
We start with a few probability bounds for the square-root Lasso and its thresholded modification where only the largest values of are not set to [math]. They almost follow straightforwardly from earlier results [4, 36, 42, 40]. As we shall apply this lemma in different contexts, we reintroduce the setting here. We consider a linear regression model with and where the rows of are independent and follow a centered normal distribution with common covariance matrix . The matrix is the column normalized version of i.e.
[TABLE]
We take
[TABLE]
and consider the square-root Lasso estimator [4, 42] ,
[TABLE]
Then, define as for any and . The plug-in variance estimator is .
Lemma 1**.**
Fix any and any . There exist constants and such that the following holds. Let be the largest integer such that
[TABLE]
For any , and with , there exists an event of probability higher than , such that
[TABLE]
Proof of Lemma 1.
We first argue that the design matrix satisfies the compatibility property (see [36, 42]) with any set of size less than and constant depending on . Indeed, Corollary 1 in [40] enforces that this property holds with probability higher than . Then, we are in position to apply Theorem 1 in [42], which implies that belongs to and that
[TABLE]
The second result of the lemma is a consequence of the first result. Denote (resp. ) the subset of the largest entries of (resp. ). From the definition of , we deduce that
[TABLE]
The result follows. ∎
5.2 Analysis of the tests , , and
5.2.1 Proof of Proposition 2 (Test )
We start with a error bound on .
Lemma 2**.**
There exists a constant such that the following holds under (). For any with (with as in Lemma 1), we have
[TABLE]
with probability higher than . Besides, for any , we have
[TABLE]
with probability higher than .
From Lemma 2, we derive that with probability higher than , we have . Setting as in Lemma 2, we derive that the test has a type I error probability less or equal to . Now consider a vector such that . From Lemma 2, we deduce that, with probability higher than ,
[TABLE]
and the test therefore rejects the null hypothesis, which concludes the proof.
Proof of Lemma 2.
Set . If , the conditions of Lemma 1 are satisfied and it follows from this lemma that
[TABLE]
with probability higher than . In the second result of Lemma 2, we restrict ourselves to the case . Hence, it suffices to prove that, conditionally to satisfying , we have with probability higher than .
Write the statistic defined by . Also, define and its diagonal part. We have
[TABLE]
We control each of these three quantities independently.
Lemma 3**.**
Let be a symmetric matrix and let . Define . For any , one has
[TABLE]
with probability higher than . Here, and respectively correspond to the Frobenius and operator norm of .
This result is a slight extension of Lemma 1 in [37] (that requires to be positive). The extension to general symmetric matrices proceeds from the same arguments and we omit the proof.
Let us first control . Each of the entries of is distributed as a quadratic form of standard normal random variables. The corresponding matrix satisfies and . Since , it follows from the above lemma together with an union bound that
[TABLE]
with probability higher than . As for and , we first work conditionally to . Fix , is distributed as quadratic form of standard normal variable and the corresponding matrix satisfies , and . It then follows from Lemma 3, that conditionally to ,
[TABLE]
with probability higher than . As for , observe that, conditionally to , is distributed as where the ’s and ’s are independent standard normal random variables. Again, we deduce from Lemma 3 that, conditionally to ,
[TABLE]
with probability higher than . Finally, we gather (33) with (34–36) to conclude that there exists such that
[TABLE]
with probability larger than .
∎
5.2.2 Proof of Proposition 3 (Test )
We first state the following lemma that characterizes the deviations of .
Lemma 4**.**
For any , any , any , and any fixed , we have for, ,
[TABLE]
with probability higher than .
Proof of Lemma 4.
We have
[TABLE]
Hence, and these variables are independent from each other. So the random variable follows a distribution with degrees of freedom. To prove the result, we only have to apply Lemma 3 with . ∎
First assume that belongs to . With probability higher than , we have
[TABLE]
where we used Condition (]). Then, we apply Lemma 1 to control with probability higher than . With probability higher than , we get
[TABLE]
so that choosing the constant large enough leads to .
Now assume that . Since is -sparse, it follows that \|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\geq d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}. Then, Lemma 4 enforces that, for small enough compared to (which is ensured by Condition ), one has
[TABLE]
with probability larger than . As a consequence, under condition (11) with a constant large enough, the type II error probability is less than .
5.2.3 Proof of Proposition 7 (test )
The following lemma is borrowed from Theorem 2.1 in [47].
Lemma 5**.**
There exist numerical constants and such that the following holds. Assume that . Consider any , any , and any . Given any estimator based on the subsample , define . We have, conditionally on ,
[TABLE]
for all .
First, assume that belongs to . Since Condition (B[]) is satisfied with a constant large enough, we can apply (38). With probability higher than , one has
[TABLE]
Then, we use Lemma 1 to conclude that
[TABLE]
with probability higher than . Setting the constant small enough, we conclude that the type I error probability of is less than .
Now assume that . Since is -sparse, it follows that \|\theta^{*}-\widetilde{\theta}_{SL,k_{0}}\|_{2}^{2}\geq d^{2}_{2}\big{[}\theta^{*};\mathbb{B}_{0}[k_{0}]\big{]}. Then, Lemma 5 enforces that, for small enough compared to , with probability higher than , one has
[TABLE]
where we used in the second and fourth line that belongs . Now assume that is large enough so that Condition (24) is satisfied. Choosing the constant in (24) large enough and the constant small enough, it then follows that the type II error probability is smaller than .
5.3 Analysis of and (Propositions 4 and 5)
In the proofs of this subsection, we set
[TABLE]
To alleviate the notation, and since only depends on the first two subsamples, can be considered as fixed when we condition to these subsamples. To simplify the notation, we respectively write henceforth and for and and work conditionally to . For any and , we have and . Hence, we have
[TABLE]
and since the are independent, we have
[TABLE]
For , it holds that
[TABLE]
As a consequence, given and , behaves almost like a standard Gaussian vector. We shall prove that, under the condition of the propositions, the term in the covariance turns out to be negligible, whereas is closely related to . The following lemma states that the conditional expectations of and are almost the same as if the conditional covariance of was the identity matrix. Recall the function introduced in Section 2.2.3. Define the function by where . As explained in Section 2.2.4 and proved in [17] (Section C.2.3), . Obviously, we have . Besides it is also shown in [17] (Section C.2.3) that .
Lemma 6**.**
If , we have
[TABLE]
Consider any . If , we have
[TABLE]
Also, the next lemma enforces that the deviations of the statistics and are almost the same as if the conditional covariance of was the identity matrix.
Lemma 7**.**
Assume that . For any , one has
[TABLE]
Besides, for any and any , one has
[TABLE]
Analysis of the tests under the null hypothesis. The assumptions of Lemma 2 are fulfilled. As a consequence, we have with probability larger than . Henceforth, we call this event and work conditionally to it. Thus, the support of is included in that of which in turn implies that which implies . Besides, we also have .
Since for , it also follows from Assumption that
[TABLE]
Thus, we are in position to apply Lemma 6. As explained in Section 2.2.3, we have and , it follows from that Lemma that \operatorname{\mathbb{E}}^{(3)}_{\theta^{*},\sigma}\big{[}Z_{f}\big{|}(\|Y\|_{2},\theta)\big{]}\leq k_{0}+s^{2}/5. Also since and we have
[TABLE]
Then, we apply the deviation inequalities of Lemma 7 and integrate them with respect to to conclude that
[TABLE]
[TABLE]
Taking the probability of the event into account, we conclude that the type I error probability of both tests is bounded by .
Analysis of the tests under the alternative hypothesis. Since is not too large, the assumptions of Lemma 2 are fulfilled. As under the null hypothesis, we have with probability higher than and we still work conditionally to this event called . If , then both tests reject the null hypothesis, so that we can assume henceforth that .
Since (44) is still valid, we are in position to apply again Lemmas 6 and 7. Hence, conditionally on and , we have
[TABLE]
with probability higher than . Define by
[TABLE]
Recall that and , (see [17]). So it holds that
[TABLE]
Also, for any , we have
[TABLE]
with probability larger than . As above, we have , and so
[TABLE]
In the sequel, we show that (45) and (46) imply the desired type II error probability bounds.
Case 1: Analysis of (45) for . Write the tuning parameter used in [17] for the corresponding test in the Gaussian sequence model. Note that , , and . We have shown in the proof of Proposition 2 in [17] that for a vector and any
[TABLE]
as soon as one of the two following condition holds for constants positive and large enough, depending only on
[TABLE]
It then follows from (45), that, given and satisfying , the test rejects the null with probability higher than if
[TABLE]
Recall that, for , . Since follows a distribution with degrees of freedom, we have with probability higher than (see Lemma 3). This implies that for any , we have
[TABLE]
Lemma 8**.**
Assume that the event holds, that , and that . We have .
As a consequence, on the intersection of and an event of probability higher than , we have
[TABLE]
Together with (47) and (48), we have characterized the type II error probability of .
Case 2: Analysis of (46) for . Observe is at most of the order of and is therefore negligible compared to . We have shown in the proof of Proposition 3 in [17] that, for a vector , and for any in we have
[TABLE]
for some , if for constants positive and large enough, depending only on
[TABLE]
Actually, in Proposition 3 in [17], we had considered a wider range of ’s as the collection was slightly larger, but this does not change the arguments here. In our setting, Condition (49) and (46) imply that for some if
[TABLE]
Then, arguing as in Case 1, we have on the intersection of and an event of probability higher than . Putting everything together, we have controlled the type II error probability of .
Proof of Lemma 6.
In view of the conditional distribution of given , one has
[TABLE]
Since , the remainder term is (in absolute value) less than
[TABLE]
Summing over all such that , we obtain the first result of Lemma 6. Turning to , we have
[TABLE]
As a consequence,
[TABLE]
where we used the condition in the second line. Summing this bound over all such that yields the desired result.
∎
Proof of Lemma 7.
We shall apply the Gaussian concentration theorem (see e.g. [5]) to both and . The covariance matrix associated to the conditional distribution of decomposes as with and in particular its operators norm is less than one. Write for a square-root of this matrix and let denote a standard Gaussian vector. Conditionally to , is distributed as . For any , define
[TABLE]
Given two vectors and , one has
[TABLE]
since the cosinus function is -Lipschitz. As a consequence, the function is -Lipschitz. The deviation inequalities (42) then follow from the Gaussian concentration theorem (see e.g. [5]).
As for , we argue similarly that, for , it is conditionally distributed as a Lipschitz function of a standard Gaussian vector with Lipschitz constant
[TABLE]
Since , we have for any and the Lipschitz constant is therefore less than
[TABLE]
where the last inequality is a consequence of the definition of and and is detailed in the proof of Lemma 6 in [17].
∎
Proof of Lemma 8.
Under , we have . Hence,
[TABLE]
where we used in the second line the definition of and and we used together with Assumption in the last line.
∎
5.4 Proof of Theorem 2
Consider any . In view of Propositions 3–5, we can bound the rejection probability as follows
[TABLE]
Since, under the null hypothesis, is -sparse, we have
[TABLE]
Applying Lemma 1, we derive that, with probability higher than , . Thus, by Condition , we have . From (51), we derive that . Looking more closely at the proof of Propositions 3–5, we observe that each occurrence of the probability corresponds to the same control of the square-root Lasso estimator . As a consequence satisfies (). Turning to the Type II error, we fix and assume that .
Case 1: . If , then the squared separation distance in (20) is a consequence of Propositions 2 and 3 and is achieved by the combination of and . If , the squared separation distance is still achieved by . To prove the last part of the result, let us assume that is such that does not reject the null with high probability. We shall prove that this implies . From Proposition 3, we have . In view of Proposition 4, we have
[TABLE]
In view of Proposition 5, we have
[TABLE]
for all . Finally, Proposition 2 enforces that
[TABLE]
for all . Putting everything together, we obtain
[TABLE]
where we used that, for , is small compared to . This concludes the proof for Case 1.
Case 2: . In that case, is larger than and the first result in (20) is a consequence of the analysis of in Proposition 3. We now turn to the case and we need to prove that the squared separation distance is less than . If , then rejects the null hypothesis with high probability. Thus, we can assume that . Also, we can assume that , otherwise the test rejects the null. Finally, we can assume that , otherwise the test also rejects the null with high probability. By triangular inequality, therefore satisfies and we are in position to apply Lemma 2, which implies
[TABLE]
with probability higher than conditionally to . As a consequence, the event involved in the proof of Propositions 4 and 5 is true. As ensuring this event is the only occurrence in the proof of these propositions where the restrictions is needed, we conclude that, given , rejects the null with probability higher than if any of the conditions (14), (15), or (19) is satisfied. Similarly, Condition (52) (with replaced by allows to adapt the proof of Proposition 2 without the restriction on . Thus, rejects the null with conditional probability higher than under (10).
Arguing as Case 1, we conclude that the aggregated test rejects the null with high probability if is large compared to .
5.5 Proof of Theorem 3
Let denote any positive integer. Let . For , we write for the vector in whose values outside have been set to [math].
These notation are also extended to matrices. Given a positive integer and a matrix , we write for the matrix defined by . For , we also write for the -dimensional matrix such that .
5.5.1 Proof of Theorem 3
Let and consider any subset satisfying the property ().
Lemma 9**.**
The exists a constant such that the following holds for all . If
[TABLE]
there exists an event of probability higher than such that
[TABLE]
where and respectively refer to the smallest and largest eigenvalue of a matrix restricted to its coordinates in .
So on the event defined above, the matrix restricted to its coordinates in is non-singular. Recall that the matrix is [math] outside . Nevertheless, we can define its pseudo-inverse by considering its inverse when restricted to and fixing all its remaining entries to 0. The restricted least-squares estimator is then conditionally distributed as follows
[TABLE]
Define the bias . On the event , it follows from the definition in Equation (25) of and Lemma 9 that
[TABLE]
Next, since follows a normal distribution (55), we can easily bound its deviations. In particular, we deduce from (53) that there exists an event of probability higher than such that on , one has
[TABLE]
Lemma 10**.**
Assume that . There exists an event of probability higher than such that on , we have
[TABLE]
Putting everything together, we derive that, under , one has
[TABLE]
This implies that, for all ,
[TABLE]
Under the null hypothesis. Suppose that . Note that (27) implies that
[TABLE]
Assume that belongs to . From (58), we deduce that, conditionally on the event , one has
[TABLE]
As a consequence, the test accepts the null hypothesis under the event .
Under the alternative hypothesis. We now assume that belongs to and satisfies
[TABLE]
Consider the set T=\big{\{}i,|\theta^{*}_{i}|\geq\sigma\underline{t}\sqrt{\frac{\log(p)}{2m}}\big{\}} of large coordinates of . In view of , we have
[TABLE]
On the event , it follows from (58) and the definition of that if
[TABLE]
Observe that . Denoting , we obtain that . We can bound in terms of the bias and then use (56) and (59).
[TABLE]
where the inequality is a consequence of (59) and . In view of Equation (60), we have , which implies and therefore . The test therefore rejects the null hypothesis under the event , which concludes the proof.
Proof of Lemma 9.
We first show (53). Recall that is independent of and that the restriction of to follows a standard Wishart distribution - all coordinates outside being [math]. by , the size of the corresponding covariance matrix is less than . From e.g. [21], we deduce, on an event of probability larger than , we have
[TABLE]
where is an universal constant. Assuming that is small compared to , we deduce that the spectrum of lies in . Thus, under , the spectrum of restricted to its coordinates lies in .
Turning to (54), we observe that follows a mean zero normal distribution. Using a deviation inequality for distribution (Lemma 3), we deduce the existence of an event of probability larger higher such that , since is large enough as assumed in Theorem 3. So from Equation (53), we deduce that, on , we have
[TABLE]
∎
Proof of Lemma 10.
follows a distribution with degrees of freedom. Using a deviation inequality for distribution (Lemma 3), we derive that , with probability higher than . Thus, it remains to bound . From the definition of the property , we deduce that
[TABLE]
∎
5.5.2 Proof of Proposition 8
Write for a stationary point of the MCP criterion and let denote its support. Since, we used the normalized design, we are more interested in the rescaled estimator defined by . As explained in the proof of Lemma 10, the design matrix satisfies, with probability higher than the compatibility property (see [36, 42]) with any set of size less than , see [40]. Besides, the restricted eigenvalue condition satisfied for sparsities of size less than are bounded by some constants depending on , see [21, 49] with probability higher than . From Lemma 1, we deduce that with probability higher than . We are therefore in position to apply Theorem 6 in [49] and Corollary 1 in Feng and Zhang [24]111Actually, our definition of MCP uses a different normalization from that in [49] and [24] and one has therefore to translate their results in our setting. provided that we chose the constant large enough and small enough. From Theorem 6 in [49] with (the support of ), we deduce that, with probability higher than ,
[TABLE]
Write for the least-square estimator of restricted to : as defined in Equation (26). We deduce from Corollary 1 in [24] that
[TABLE]
The restricted least-square estimator follows a normal distributions with mean and covariance where we consider here the pseudo-inverse. The eigenvalues of are bounded by the restricted eigenvalue condition on the design . Hence, we obtain with probability higher than , from some . This implies that if . We obtain
[TABLE]
The result follows.
5.5.3 Proof of Theorem 4
To alleviate the notation, we simply write for in this proof. For a random vector , we write for the conditional variance of given . Standard computations for conditional variance based on Schur complement lead to
[TABLE]
where is the pseudo-inverse of obtained by considering its inverse when restricted to and setting all its remaining entries to [math].
In the sequel, we denote and . The following lemma ensures that the linear regression of on involves the restriction of to .
Lemma 11**.**
Fix any and consider the event such that . Then, given , the rows of are independent and follow a centered normal distribution with covariance matrix . Besides, we have
[TABLE]
The next lemma ensures that the population covariance matrix of the projected design still belongs to .
Lemma 12**.**
For any and any set , The restriction of to belongs to .
Denote . For , Property () is said to be satisfied if there exists an event measurable with respect to of probability higher than such that the three following inequalities hold:
[TABLE]
Assume that the property holds and recall that . Since is an integer, , and , there exists an event of probability larger than such that
[TABLE]
which, with , is the result of Theorem 4. Thus, it suffices to prove by induction.
Lemma 13**.**
Assume that with . Assume that is such that
[TABLE]
(Recall that is introduced in Lemma 1). Then, given , there exists an event measurable with respect to of probability higher than such that
[TABLE]
Step :
Recall that . By Lemma 13 and Equation (30), there exists an event with probability higher than such that
[TABLE]
Counting the components of that are larger (in absolute value) than , we derive that
[TABLE]
which, together with the previous bound implies
[TABLE]
So holds.
Induction step:
Assume that holds for some . By and on , we have that by Condition (30). Thus, is large enough and we can apply Lemma 13. As a consequence, there exists an event of probability higher than such that
[TABLE]
Together with , this implies . As for the proof of , we lower bound by considering separately the entries larger than (in absolute value) than . This leads us to
[TABLE]
where we used in the second line. We have proved . This concludes the proof.
Proof of Lemma 11.
To alleviate the notation, we simply write for , for , (resp. ) for (resp. ), for the expectation , and for (in the proof of this lemma only). Besides, since has been built based on independent samples, we consider it as fixed. Also, without loss of the generality, we assume that .
Define . Since follows a normal distribution, is independent of . Besides, the rows of are i.i.d. distributed according to centered normal distribution with covariance matrix . Since the rows of are i.i.d., each column of is a linear combination of the columns of . As a consequence, there exists a matrix such that .
Since and since is invertible, the rank of equals almost surely. As a consequence, applying the orthogonal projection along to leads to
[TABLE]
Since the rows of are i.i.d. with covariance , there exists a matrix with i.i.d. standard normal entries such that where is a square root of . As a consequence, . Since is independent of it follows that, given , the matrix is made of independent standard normal entries and the rows of therefore follow independent normal distributions with covariance matrix .
Also we have
[TABLE]
since the columns of in are equal to zero. Given , is projection of a standard normal vector onto a subspace of dimension . As a consequence, follows a normal distribution with covariance matrix and is independent of . The result follows. ∎
Proof of Lemma 12.
For simplicity, we write for . Let be a normed vector supported in . We shall prove that belongs to . Consider a random vector so that . Consider the size covariance matrix of . Then, and , which therefore lies in . ∎
Proof of Lemma 13.
To alleviate the notation, we simply write and for \widehat{\theta}_{(SL)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]} and \widehat{\theta}_{(SL,t)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]} respectively. Recall that . The rows of corresponding to indices in are null. Therefore, \widehat{\theta}_{(SL)}\big{[}\underline{Y}_{\perp}^{(t+1)},\underline{\mathbf{X}}_{\perp}^{(t+1)}\big{]} is a square-root Lasso estimator of given the restriction of to the rows in . In view of Lemmas 11 and 12, we can apply Lemma 1. Thus, given , there exists an event of probability higher than such that
[TABLE]
By assumption, . Since is a hard thresholded modification of at level
[TABLE]
its entry-wise error increases only at the non-zero entries of and at most by . This implies that
[TABLE]
Recall that is the support of . Each non-zero entry of is equal to that of . As a consequence, each index in the support of and outside the support of contributes at least by in the loss . This implies
[TABLE]
which in view of (63) leads us to and
[TABLE]
which concludes the proof.
∎
6 Proofs of the minimax lower bounds
We first state the following classical lemma that links the total variation distance with the performance of a test with composite hypotheses. Some variants of it may be found in textbooks such as [43]. For a sake of completeness, we provide a proof below.
Lemma 14**.**
Consider a parametric model and two subsets . Let and be any probability measures on . Denote for . Any test of against satisfies
[TABLE]
Proof of Lemma 14.
For , define the probability measure by for any event . Given , let . It follows from Le Cam’s arguments that
[TABLE]
By triangular inequality, one has
[TABLE]
Obviously, the total variation distance equals .
[TABLE]
Arguing similarly for and plugging these bound into (65) concludes the proof.
∎
6.1 Proof of Proposition 1
Proof of Proposition 1.
Intuitively, testing the sparsity for is impossible because cannot be even recovered in noiseless setting () when it contains more than non-zero entries. As the design matrix is random, this argument needs to be slightly refined. Without loss of generality, we consider the case , and . Let us write the submatrix of made of its first columns. In order to apply Lemma 14, we shall build two suitable prior distributions on the set of and sparse vectors.
With probability one, the square matrix is invertible. Also denote (resp. ) the smallest (resp. highest) singular values of . Fix any . As stated for instance in [41], there exist , such that the following holds
[TABLE]
where stands for the distribution of . Here, stands for the -th column of . Although the exact expression of and is not relevant in this proof, these two quantities are of the order and .We call the event defined in the above probability bound.
Let stand for the centered Gaussian measure in with covariance matrix \big{(}\begin{array}[]{cc}\mathbf{I}_{n}&0\\ 0&0\end{array}\big{)} . We write . Given any , define the vector . We fix . We argue that, for small enough, the total variation distance is smaller than .
Under , for a fixed , it holds that whereas, under , it holds that . When satisfies , these two covariance matrices are invertible with eigenvalues in and . Thus, for going to zero, the total variation distance between these conditional distributions goes to zero uniformly over all satisfying . In particular, there exists some such that these distances are uniformly smaller than . Since , it follows that
[TABLE]
Consider whose value will be fixed later. Define and the distributions associated to the linear regression models. By contraction properties of the total variation distances, one has
[TABLE]
When is sampled according to , then the smallest (in absolute value) entry of among the first entries is larger than some positive quantity , with probability larger than . Let us call the corresponding event. Define as the measure conditioned to the event , i.e. for any measurable event . Then, we introduce . By triangular inequality, we obtain
[TABLE]
When is sampled according to , satisfies . As a consequence of Lemma 14, any test of versus has a risk higher than . We have
[TABLE]
where does not depend on . Taking arbitrarily small leads to the desired result.
∎
6.2 Proof of Theorem 1
Given integers and , and , we define the collection
[TABLE]
We start by a simple reduction result to narrow the range of parameters. Its proof is postponed to the end of the section.
Lemma 15**.**
For any , we have
[TABLE]
For the sake of the following bound, we explicit the dependency of with respect to by denoting it . For any and , we have
[TABLE]
In other words, the minimax separation distance in non-decreasing with respect to and, up to a change in the number of covariates, it is also nondecreasing with respect to . Next, we state three lemmas whose combination implies Theorem 1.
Lemma 16**.**
Assume that . There exists a numerical constant such that
[TABLE]
for any , all and .
Proof of Lemma 16.
This lemma is a consequence of known signal detection lower bounds (). For instance, it is proved in [46, Sect.9.1] in
[TABLE]
for all . Since and , Lemma 15 entails that
[TABLE]
which concludes the proof. ∎
Lemma 17**.**
Assume that . There exist constants – such that the following holds for all :
[TABLE]
for all and . Furthermore, if and and , then
[TABLE]
Proof of Lemma 17.
In the above lemma, the minimax lower bounds both depend on the size of the null hypothesis and on the size of the alternative hypothesis. As a consequence, we cannot directly rely anymore on signal detection results as in the previous lemma. Nevertheless, we will introduce a third party hypothesis and make make use of previous signal detection lower bounds for unknown [45, 46].
By Lemma 15, we assume without loss of generality that . Given and , we define as the uniform measure over the set
[TABLE]
and the mixture measure . As a way to derive minimax lower bounds for signal detection with unknown noise level, it is proved in [45, Theorem 4.3] and [46, Lemma 9.3]222Actually, the results in [45, 46] are expressed in terms of minimax separation distance, the total variation distance control being stated in their respective proof. that if or if with
[TABLE]
Let us now deduce (70). Since is increasing with respect to , we have
[TABLE]
Under , is -sparse, whereas under , is -sparse and its square distance to is . From Lemma 14, we deduce that, for , one has
[TABLE]
which enforces (70) since we have . Turning to (71), we observe that, under the assumptions of the lemma (and with a suitable choice of ), k\log\big{(}\tfrac{\sqrt{p}}{e^{3/2}k}\big{)}\geq n both for and . Arguing as above, we deduce that
[TABLE]
since the expression inside the exponential is bounded away from zero and since for . We have proved (71). ∎
The following lemma provides the key new lower bound. It corresponds to the regime where both and are large. Its proof relies on more advanced arguments than the other regimes.
Lemma 18**.**
There exists positive numerical constant and such that the following holds for any and all . For any and , one has
[TABLE]
Proof of Theorem 1.
First we prove (6). The case is a consequence of Lemmas 16 and 17. As for the case , we divide the analysis into several subcases. If , it follows from Lemma 17 that is at least of the order of which is larger than the lower bound in (6). For we rely on Lemma 18. For , we define . From the reduction (68) and Lemma 18, we derive that
[TABLE]
Finally, the lower bound (7) is a consequence of the second part of Lemma 17 together with the reduction lemma 15.
∎
Proof of Lemma 15.
The first bound is a simple consequence of the inclusion . Let us turn to (68). Take any arbitrarily small and define . There exists a test satisfying . For any linear regression problem with covariates and response , we sample new independent covariates, write the corresponding new design matrix of size , and define where is the constant vector of size . Since , we have
[TABLE]
implying that . Taking the infimum over all , we obtain (68). ∎
Proof of Lemma 18.
Without loss of generality we assume that the noise level is equal to one and we write for . Since the minimax separation distance is a nondecreasing function of , we have for any . In view of (72) and since for any , we only need to prove (72) for .
Define and . We introduce two priors and that are almost supported on and respectively and such that the first moments of and are matching. In Step 3 below, we show that this moment matching property ensures that the corresponding mixture distributions of are close in total variation distance.
Step 1. Construction of the priors.
As in [17], we build prior measures and in such a way that their first moments are matching. Define the two quantities where is redefined only in this proof as follows)
[TABLE]
for some universal constant whose value will be fixed later. The following result is borrowed from [17, Lemma 3].
Lemma 19**.**
Given any positive and even integer and , define
[TABLE]
There exists two positive and symmetric measures and whose support lie in satisfying:
[TABLE]
Fix . Then, given , we consider the measures and as in Lemma 19. Given any measurable event , we define and by
[TABLE]
Here, stands for and is the Dirac measure at [math]. In view of this definition, the first moments of and are matching.
Step 2. Properties of the priors.
We consider the prior measures and . In view of Lemma 14, we need to show that is concentrated on and that is concentrated on for some large .
Under , follows a binomial distribution with parameter . By Chebychev’s inequality,
[TABLE]
since . Similarly,
[TABLE]
Under the event , the corresponding parameter satisfies
[TABLE]
Since for and since for any , we deduce that
[TABLE]
. As a consequence, with probability larger than , belongs to with . To apply Lemma 14, it remains to bound the total variation distance between
[TABLE]
Step 3. Control of .
For , define the distribution with and . By triangular inequality, one has
[TABLE]
This upper bound greatly simplifies the following computations as the distributions and only differ by one coordinate. Unfortunately, we conjecture that our minimax lower bound in Theorem 1 is suboptimal in the regime where is close to precisely because of the upper bound (81). In the arguably simpler Gaussian sequence model [17], we have directly computed the distances between the corresponding distributions and to obtain the sharp separation distance in all regimes. If we use instead the decomposition (81) for the Gaussian sequence model, this leads to a suboptimal lower bound for close to . To close this gap in the linear regression model, one would therefore need to directly handle the distance between and but we were not able to do it.
In the following, we shall bound independently each of these distances . Interestingly, and only differ by the distribution of the -th coordinate of . The general idea is to condition with respect to all the coordinates except the -th one so that we consider a linear regression model with only one covariate.
Let us write the conditional density of given under .
[TABLE]
Writing down the expectation with respect to , we have
[TABLE]
by permutation invariance. We call this last quantity.
Given a -dimensional vector , let be such that and for all . Write and .
[TABLE]
where the quantities and are defined by
[TABLE]
Let be the event such that .
Fix any such that . Then, under , follows a distribution with degrees of freedom. As a consequence of deviation inequalities for distributions (Lemma 3), its probability to be larger than is smaller than . Besides, conditionally to , follows a normal distribution with mean and variance . As a consequence, under the event , the probability that is smaller than . In view of the definition (73) of and by taking the constant in that definition small enough, we conclude that
[TABLE]
for all such that . We set
[TABLE]
It follows from this definition and from Equation (82) that
[TABLE]
.
In order to work out the term , we rely on the power expansion of together with the nullity of the first moments of .
[TABLE]
since, by (73), if we fix . Plugging this bound into the definition of , we obtain
[TABLE]
by definition (73) of . Together with (83), this implies that . Then, we use the definition of and (81) to conclude that
[TABLE]
which is smaller than for large enough since the assumptions of Lemma 18 enforce that . In view of the above bound, (78), (79), and (79), we are in position to apply Lemma 14. Thus, for large enough, we conclude that
[TABLE]
∎
6.3 Proof of Proposition 6
Since , the second part of (21) comes from Theorem 1. Turning to the first part of (21), we have already pointed out in the proof of Lemma 17 that it is proved in [45] and [46] that, for all one has
[TABLE]
These two bounds imply that, for large enough and for all , one has
[TABLE]
Since is nondecreasing with respect to (Lemma 15), the above bound is also valid for all at the price of worse constants. Finally, we apply Lemma 15 together with the assumption to obtain the first part of (21).
Acknowledgements.
The work of A. Carpentier is partially supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether grant MuSyAD (CA 1488/1-1), by the DFG - 314838170, GRK 2297 MathCoRe, by the DFG GRK 2433 DAEDALUS, by the DFG CRC 1294 ’Data Assimilation’, Project A03, and by the UFA-DFH through the French-German Doktorandenkolleg CDFA 01-18. The authors thank anonymous reviewers for their helpful suggestions that improved the manuscript. The authors are also grateful to Alexandre Tsybakov and Cun-Hui Zhang for bringing to our knowledge some recent work on MCP.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Arias-Castro, E. Candes, and Y. Plan. Global Testing under Sparse Alternatives: ANOVA, Multiple Comparisons and the Higher Criticism. Annals of Statistics , 39:2533–2556, 2011.
- 2[2] Yannick Baraud. Non-asymptotic minimax rates of testing in signal detection. Bernoulli , 8(5):577–606, 2002.
- 3[3] Yannick Baraud, Sylvie Huet, and Béatrice Laurent. Testing convex hypotheses on the mean of a Gaussian vector. Application to testing qualitative hypotheses on a regression function. Annals of statistics , pages 214–257, 2005.
- 4[4] A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika , 98(4):791–806, 2011.
- 5[5] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities . Oxford University Press, Oxford, 2013. A nonasymptotic theory of independence, With a foreword by Michel Ledoux.
- 6[6] Jelena Bradic, Jianqing Fan, and Yinchu Zhu. Testability of high-dimensional linear models with non-sparse structures. ar Xiv preprint ar Xiv:1802.09117 , 2018.
- 7[7] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media, 2011.
- 8[8] T Tony Cai and Zijian Guo. Accuracy assessment for high-dimensional linear regression. ar Xiv preprint ar Xiv:1603.03474 , 2016.
