Average case analysis of Lasso under ultra-sparse conditions
Koki Okajima, Xiangming Meng, Takashi Takahashi, Yoshiyuki Kabashima

TL;DR
This paper provides an average-case analysis of Lasso in ultra-sparse linear models using a novel replica method approach, offering insights into support recovery and performance without restrictive assumptions.
Contribution
It introduces a new analytical framework for Lasso's performance in ultra-sparse regimes, extending previous results to more general settings and noise conditions.
Findings
Provides a lower bound on sample complexity for support recovery
Generalizes previous bounds to non-Gaussian noise
Supports analysis with extensive numerical experiments
Abstract
We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors grows larger keeping the true support size finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where , and the number of observations tend to infinity at the same rate. Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of and , the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when diverges as . The obtained boundā¦
| 6.00 | ||
| 10.0 | ||
| 4.89 | ||
| 8.89 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference Ā· Statistical Methods and Bayesian Inference Ā· Bayesian Methods and Mixture Models
Average case analysis of Lasso under ultra-sparse conditions
Koki Okajima
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā
Xiangming Meng
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā
Takashi Takahashi
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā
Yoshiyuki Kabashima
Department of Physics, The University of Tokyo
Abstract
We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors grows larger keeping the true support size finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where , and the number of observations tend to infinity at the same rate.
Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of and , the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when diverges as . The obtained bound for perfect support recovery is a generalization of that given in previous literature, which only considers the case of Gaussian noise and diverging . Extensive numerical experiments strongly support our analysis.
1 Introduction
An important objective of high dimensional statistics is to extract information in situations where the signalās dimension is overwhelmingly large compared to the accumulated sample size . It is crucial to incorporate prior knowledge on the signal structure to reduce the signal space dimension for reliable estimation. A particularly common assumption is sparsity, which postulates that the true signal has few nonzero entries. Exploiting this property allows one to obtain robust and interpretable results specifying the few relevant variables explaining the retrieved data (Donoho, , 2006).
For instance, consider the sparse linear regression problem where measurements of the signal with non-zero components are given by the linear model
[TABLE]
where is the sensing matrix, and is the noise vector distributed according to . The most fundamental yet popular sparse signal estimation method is the least absolute shrinkage and selection operator (Lasso) (Tibshirani, , 1996), which offers the estimator by solving the following convex program:
[TABLE]
where is a regularization parameter. Since its introduction, this simple -regularization scheme has been successfully adapted as a backbone technique for solving a wide variety of sparse estimation problems. A particularly interesting question to ask is if one can make any guarantees on the performance of Lasso under general scalings of , its dependence on , and statistical properties of the noise and true signal.
A sheer amount of research has been devoted to assessing the performance of Lasso. Traditionally, research based on the irrepresentability condition (Meinshausen and Bühlmann, , 2006; Zhao and Yu, , 2006) has been popular in establishing guarantees in terms of support recovery of the sparse signal (Wainwright, 2009b, ; Dossal et al., , 2012; Meinshausen and Bühlmann, , 2006; Zhang and Huang, , 2008; Candès and Plan, , 2009; Zhao and Yu, , 2006). A different approach based on approximate message-passing (AMP) theory (Donoho et al., , 2009), and the heuristical replica method (Mézard et al., , 1986) from statistical physics has focused on assessing the sharp, asymptotic properties of Lasso in the large and limit under random sensing matrix designs. Despite the previous works, the understanding of the Lasso estimator is still limited. Analysis based on the irrepresentability condition often offers only scaling guarantees with respect to , or statements with strong assumptions on the regularization parameter. Besides, the AMP/replica-based analysis has been only limited to linear sparsity, i.e. and as , which may be somewhat unrealistic compared to real-world situations.
1.1 Contributions
In this work, we complement the drawbacks in both the irrepresentability condition approach and AMP / replica approach by theoretically analyzing the average performance of Lasso when , i.e. the ultra-sparse case (Donoho etĀ al., , 1992; Bhadra etĀ al., , 2017), which is a more typical situation in certain applications such as materials informatics (Ghiringhelli etĀ al., , 2015; Kim etĀ al., , 2016; Pilania etĀ al., , 2016). Moreover, our result offers a necessary condition for support recovery in the limit . Specifically, our contributions are summarized as follows:
- ā¢
We provide a new way to apply the replica method in the ultra-sparsity regime. This is done by explicitly handling the correlations and finite-size effects acting on the active set , which is otherwise ignored in conventional analysis (Section 2.1, Claim 1).
- ā¢
Using this enhanced replica method, we precisely evaluate the average property of Lasso under ultra-sparsity and standard Gaussian matrix design, i.e. each element of is i.i.d. according to a standard Gaussian distribution. This provides an extension to previous results derived from the AMP theory and the replica method, where linear sparsity is necessary for the analysis (Section 2.2, Claim 2).
- ā¢
We derive a necessary condition for partial support recovery under some mild conditions (Assumption 1). Specifically, the number of false positives, and subsequently the model misselection probability vanishes only if for . This constant is determined by the mean prediction error of an oracle (Section 2.3, Claim 3, 4).
- ā¢
In addition to partial support recovery, the analysis also provides a necessary condition for perfect support recovery , which generalizes the sample complexity bound given by Wainwright, 2009b for i.i.d. Gaussian noise distributions in the limit to more general noise distributions under constant (Section 2.3, Claim 5).
- ā¢
We demonstrate that our theory agrees well with experiment by conducting extensive numerical simulations (Section 3).
Note that all of the results are derived from the enhanced replica method, which is yet to be proven rigorously; hence the statements are presented as claims.
1.2 Related Work
Irrepresentability Condition.
As aforementioned, the irrepresentability condition, first introduced by Meinshausen and Bühlmann, (2006) and Zhao and Yu, (2006), has been an important cornerstone, as it establishes a sufficient condition for perfect support recovery. This condition indicates whether the covariates, i.e. the columns of , are linearly independent enough to be distinguishable from one another, and hence variable selection is relatively feasible. It has been revealed that Lasso is an āoptimalā support estimator in the sublinear regime , i.e. Lasso has its success/failure threshold for sample complexity in the same order as the informational-theoretical one (Fletcher etĀ al., , 2009; Wainwright, 2009a, ). However, little is known about the constants involved in these conditions. Wainwright, 2009b provided necessary and sufficient conditions for perfect support recovery under random Gaussian matrices for diverging . This is a simple and explicit bound which depends on the regularization parameter and intensity of the noise, which is restricted to i.i.d. Gaussian. Focusing on the case , Dossal etĀ al., (2012) derived sufficient conditions for partial and perfect support recovery under deterministic noise, whose bound is similar to the one given in Wainwright, 2009b .
AMP theory.
A particular line of work has aimed in assessing the properties of Lasso under general random matrix designs via careful analysis of the dynamical behavior of the AMP algorithm (Kabashima, , 2003; Donoho etĀ al., , 2009; Takahashi and Kabashima, , 2022), whose convergence point coincides with (2) in the large limit. Rather than establishing inequality bounds or conditions, the objective is to establish sharp results on the Lasso for a random instance of . Although analysis is limited to linear sparsity regime, powerful and precise results have been proven rigorously under this framework (Bayati and Montanari, , 2012). For instance, Su etĀ al., (2017) and Wang etĀ al., (2020) determine the possible rate of false positives and true positives achievable under certain settings, which can be obtained by solving a small set of nonlinear equations. Nevertheless, the analysis does not give insight on support recovery, since this is impossible in the linear sparsity regime (Fletcher etĀ al., , 2009; Wainwright, 2009a, ).
Replica method.
Results similar to those from AMP theory have also been derived by using the non-rigorous replica method in statistical mechanics. Unlike AMP theory, which is based on a convergence analysis of a particular algorithm, the replica method aims at directly calculating the average over of a cumulant generating function for some probability distribution, i.e. of the form . This calculation is often encountered in the field of statistics, where one is interested in the average behavior of a statistical model. While lacking a complete proof, this method has been successful in predicting the average performance of machine learning and optimization methods under general random designs in the linear sparsity regime (Vehkaperä et al., , 2016; Zdeborovà and Krzakala, , 2016). In fact, under certain assumptions, the average predictions given by the replica method have been proven to be consistent with the asymptotic results obtained from AMP theory and other rigorous methods (Stojnic, , 2013; Thrampoulidis et al., , 2018). Similar to AMP theory, however, reliable adaptations of this method outside linear sparsity are still open problems. Previous research such as Abbara et al., (2020), Meng et al., 2021a and Meng et al., 2021b analyzed the performance of sparse Ising model selection using a variation of the replica method. However, this was accomplished through a series of ansatzes which are generally difficult to justify theoretically.
1.3 Preliminaries
Here we summarize the notations used in this paper. The expression denotes the norm. The active set is defined as the support of the -sparse true signal , . Define , the size of the inactive set. The matrix denotes the submatrix constructed by concatenating the columns of with indices in . The vector denotes the subvector of with indices in . For simplicity, is assumed to be a deterministic, although this can be extended to random signals trivially. The expression denotes the average over the joint probability with respect to the pair , i.e.
[TABLE]
where denotes the Dirac delta function. The definition of follows straightforwardly from the above. Also, define as the standard Gaussian measure, for . Given and regularization parameter , the oracle Lasso estimator is defined as , which is the Lasso estimator with the true support identified beforehand. It is also convenient to define the oracle Lasso fit, defined by , with its dependence on suppressed for convenience.
Given configuration , and regularization parameter , the number of false positives and the number of true positives of the lasso estimator is defined as
[TABLE]
where denotes the complement of set from . Without confusion, the dependence on is suppressed for convenience.
We say that an event holds with asymptotically high probability (w.a.h.p.) if there exists a constant such that . We also say that holds with probability approaching one (w.p.a.1) if as .
2 Replica analysis
Define the Boltzmann distribution as
[TABLE]
where is the normalization constant. Note that in the limit , (5) converges to a point-wise distribution concentrated on the Lasso estimator . The main objective of our analysis is to calculate the average of the logarithm of over the random variables in the limit , which is called the free energy or the cumulant generating function
[TABLE]
The properties of averaged over the population of can then be assessed by taking appropriate derivatives of .
Although (6) is difficult to calculate straightforwardly, this can be resolved by using the replica method (Mézard and Montanari, , 2009; Mézard et al., , 1986), which is based on the following equality
[TABLE]
Instead of handling the cumbersome expression in (6) directly, one calculates the average of the -th power of for , analytically continues this expression to , and finally takes the limit . Based on this replica ātrickā, it suffices to calculate
[TABLE]
up to the first order of to take the limit in the right hand side of (7).
2.1 Outline of the derivation
Here, we only give a brief outline of the derivation; for details, see Supplementary Materials. Rewriting , it is convenient to introduce the auxillary variable , which accounts for the effect from the variables not in the true support in each replica . A crucial observation is that is statistically independent from , , which allows the average to be taken individually.
By taking the average over the Gaussian variables first, we find that is Gaussian with zero mean and covariance . By assuming the replica symmetric (RS) ansatz (Mézard et al., , 1986)
[TABLE]
the integral for the replicated vectors over the whole space is restricted to a subspace satisfying the constraints (9). More explicitly, one can rewrite (8) as
[TABLE]
where corresponds to the contribution from the RS constraint: i.e.
[TABLE]
and is the contribution from the second line of (8), albeit simplified as a result of replica symmetry:
[TABLE]
By using the Fourier representation of the delta function, (11) can be further rewritten as
[TABLE]
Using this expression, the integral with respect to in (10) can be calculated analytically. Performing the saddle point approximation for large to the integrals with respect to , and finally taking the limit after in (7) yields the following expression for .
Claim 1**.**
The free energy is given by
[TABLE]
Here, , is the complementary error function , and refers to the extremum condition with respect to , which are random variables dependent on .
Straightforward calculation shows that the extremum conditions are given by
[TABLE]
where the second equality in (2.1) is from Theorem 1 in Tibshirani and Taylor, (2012). Note that the dependence of on is not explicitly written for sake of simplicity. This evaluation of reduces the high-dimensional integral over and to an average over a four-dimensional extremum problem involving a ādimensional integral with respect to , which can be numerically computed via iterative substitution and Monte Carlo sampling over and .
It is interesting to compare our replica analysis in the large and limit to the ones considering linear sparsity (Kabashima etĀ al., , 2009; VehkaperƤ etĀ al., , 2016). In linear sparsity, the lasso estimatorās statistical property can effectively be described by a population of decoupled, independent scalar estimators under Gaussian noise with identical intensity as . This is often referred to as the decoupling principle in information theory; see Guo and VerdĆŗ, (2005) and Bayati and Montanari, (2011) for details. In the ultra-sparse case, the elements of the Lasso estimator in the active set, consisting of terms, cannot be expected to decouple, as finite-size effects of non-Gaussian and correlated nature are expected to be significant to describe its profile. This is why a body optimization procedure and the average with respect to appears explicitly in (14). On the other hand, the decoupling principle is implicitly employed for the non-active variables conditioned on . More explicitly, for each configuration of , each element of the non-active Lasso estimator is statistically equivalent to
[TABLE]
where are i.i.d. according to . Note that the decoupling principle, rigorously proven under AMP theory, does not necessarily need and to diverge at the same rate (Rush and Venkataramanan, , 2018).
2.2 Performance assessment of Lasso
The free energy allows convenient evaluation of averages of certain functions of the estimator. More explicitly, for a function , its average with respect to the Boltzmann distribution (5) and is given by
[TABLE]
where
[TABLE]
For a class of functions , the above can be calculated trivially, which we state in the following claim:
Claim 2** (Average with respect to active and inactive sets).**
For arbitrary functions and , we have
[TABLE]
and
[TABLE]
where , and is given by the solution of the extremum conditions (15)ā(18) for each . In particular, performance measures such as the average of true positives (), false positives () and error is given by
[TABLE]
2.3 Necessary condition for support recovery
A particular topic of interest is partial support recovery, and the minimum number of samples necessary for the false positives to vanish in the limit . Although the fixed point equations (15) ā(18) do not admit a closed form solution, a necessary condition in terms of the sample complexity can be derived under the following mild conditions:
Assumption 1**.**
**
- A:
(Uniqueness of fixed point) The solutions of the fixed point equations (15)ā(18) are unique and satisfy . 2. B:
(Concentration of the oracle Lasso estimator) The random variable
[TABLE]
*has finite mean and variance converging to zero. *** 3. C:
(Bounded variance of noise distribution) The distribution satisfies
[TABLE]
for some constant .
Claim 3** (Necessary sample complexity for asymptotically zero false positives).**
*Let diverge with with scaling . Under Claim 1 and Assumption 1, if there exists a constant such that in the limit , then *
[TABLE]
holds for any constant , where
The proof is postponed to Section 4. From this claim, the necessary sample complexity for partial support recovery follows immediately:
Claim 4** (Necessary sample complexity for partial support recovery).**
Under the settings in Claim 3, if w.a.h.p., then .
By definition, is the prediction error of the oracle, which is given the sensing submatrix and observation vector . This is reminiscent of the primal-dual witness construction in Wainwright, 2009b , where sufficient conditions for asymptotically zero FPs are derived by solving the oracle Lasso first, and observing whether the oracle solution concatenated with zero elements is a unique solution of the original Lasso problem (2).
Furthermore, the necessary condition for perfect support recovery can also be derived using Claim 3.
Claim 5** (Necessary sample complexity for perfect support recovery).**
Under the settings in Claim 3, suppose holds w.a.h.p. Then
[TABLE]
holds for any constant , where .
Note that in the special case of Gaussian noise with variance , we have , which extends the result of Wainwright, 2009b , Theorem 4 to the case . Moreover, our result can be applied to any noise distribution satisfying Assumption 1.C.
3 Numerical experiments
3.1 Non-asymptotic results
To verify the derived results based on Claim 1, numerical experiments were conducted. For simplicity, we consider the case where the active set has size with , and is generated from a Gaussian distribution with variance . Here, the value of is taken to be small enough such that finite-size effects are nonignorable. The values of and obtained from our replica predictions (24)-(26) are compared with the average over experimental runs. The average with respect to for obtaining the replica prediction was approximated using a Monte Carlo procedure over samples.
Figure 1 shows that all three values from theory and experiment are in good agreement for parameters and .
3.2 Asymptotic results
Claims 3 and 4 are also verified via numerical experiments; see Supplementary Materials for numerical experiments on Claim 5. In order to access the critical point in (27), Monte Carlo experiments were conducted to evaluate for different values of . Figure 2 shows the value of at and for both and . From its asympototic behavior, can be evaluated as the values given in Table 1. Interestingly, for the case , approaches and for and respectively, which is equivalent to given in Claim 5.
Figure 3 shows the average number of FP and partial support recovery probability over 10,000 experimental runs for in the vicinity of the numerically evaluated for different values of . We observe that for , the average FP is consistently nondecreasing with respect to , while partial support recovery probability is consistently nonincreasing with respect to .
4 Proofs
4.1 Proof of Claim 3
The following lemmas will be useful in the proof.
Lemma 1** (Lemma 1, Dossal etĀ al., (2012)).**
There is a finite increasing sequence with such that for all , the sign and support of are constant on each interval .
Lemma 2** (Lemma 1, Tibshirani and Taylor, (2012)).**
The Lasso fit is 1-Lipschitz continuous with respect to norm.
Lemma 3** (Theorem II.13, Davidson and Szarek, (2001)).**
Let be a random matrix with i.i.d standard Gaussian entries. The largest and smallest eigenvalue of satisfy
[TABLE]
for and
[TABLE]
for .
We now prove Claim 3. Define
[TABLE]
Let us evaluate the difference between and when . Using the Cauchy Schwartz inequality and symmetry ,
[TABLE]
The triangle inequality and Lemma 2 implies that
[TABLE]
and similarily,
[TABLE]
To derive a bound for the last term in (4.1) and (4.1), Lemma 1 is employed. Let the support and sign of be constant in intervals , where . Let the support set in interval be given by , and define be the sign vector of restricted to . From the KKT conditions, the Lasso fit is expressed as
[TABLE]
where denotes the pseudoinverse of matrix . We deduce
[TABLE]
Lemma 3, with the inclusion principle implies that w.a.h.p., . The relations (31) ā (4.1), and inequality then leads to the following holding w.a.h.p.
[TABLE]
We now use the following lemma which shows that and are negligible almost surely.
Lemma 4**.**
Under the assumptions of Claim 3, and holds w.a.h.p.
The proof is given in Supplementary Materials. Since is bounded by w.p.a.1 from Lemma 3 and Assumption 1.C, the right hand side of eq. (4.1) is of w.p.a.1.
We therefore have
[TABLE]
On the other hand, the extremum conditions (16) and (2.1) imply that is always bounded.
Lemma 5**.**
Suppose the extremum conditions (15)ā(18) are satisfied. Then, the variable satsfies
[TABLE]
Combined with (35), for sufficiently large
[TABLE]
holds for arbitrary constant . This implies that must be larger than the median of . Now, the difference between the median and average is no larger than one standard deviation, which is negligible from Assumption 1.B. This yields the statement of the claim in the limit .
4.2 Proof of Claim 4
From Theorem 6 in Osborne etĀ al., (2000), the number of false positives is bounded by . Hence, we have for some . The statement of Claim 4 then follows from Claim 3.
4.3 Proof of Claim 5
From Claim 3 and 4, it suffices to show that
[TABLE]
The KKT conditions imply that w.p.a.1, where we abbreviated and . Therefore, can be decomposed into a sum of two linearly independent vectors
[TABLE]
where , and is the projection onto the kernel of . The average of the squared norm of can be evaluated as
[TABLE]
where the last inequality follows from Jensenās inequality and (Davidson and Szarek, , 2001).
To obtain a lower bound on the squared norm of , fix the vector . Noticing that entries of are i.i.d. standard Gaussian, the tail bound for ārandom variables (Laurent and Massart, , 2000) implies that for some constant ,
[TABLE]
Using this inequality, (30) with and the union bound, we have that
[TABLE]
Equation (38) immediately follows from (40) and (4.3), which completes the proof.
5 Conclusion
In this paper, we provided an analysis based on an enhanced replica method for assessing the average performance of the Lasso estimator under ultra-sparse conditions. Besides, we deduced conditions necessary for support recovery which are derived from the oracle Lasso estimator. Numerical experiments strongly support the validity of our analysis.
The methodological novelty originates from an observation of finite-size effects and correlations within the active set, which is implicitly assumed to be negligible in the conventional replica analysis. We anticipate that this framework is applicable to analysis of other machine learning or optimization problems where finite-size effects are nonnegligible. Extending this method further to more general sensing matrix ensembles is also another exciting direction for future work.
Acknowledgements
This work was partially supported by JSPS KAKENHI Grant Nos. 22J21581 (KO), 21K21310 (TT), 17H00764, 19H01812, 22H05117 (YK) and JST CREST Grant Number JPMJCR1912 (YK).
References
- Abbara etĀ al., (2020)
Abbara, A., Kabashima, Y., Obuchi, T., and Xu, Y. (2020).
Learning performance in inverse Ising problems with sparse teacher couplings.
Journal of Statistical Mechanics: Theory and Experiment, 2020(7):073402.
- Bayati and Montanari, (2011)
Bayati, M. and Montanari, A. (2011).
The dynamics of message passing on dense graphs, with applications to compressed sensing.
IEEE Transactions on Information Theory, 57(2):764ā785.
- Bayati and Montanari, (2012)
Bayati, M. and Montanari, A. (2012).
The lasso risk for gaussian matrices.
IEEE Transactions on Information Theory, 58(4):1997ā2017.
- Bhadra etĀ al., (2017)
Bhadra, A., Datta, J., Polson, N.Ā G., and Willard, B. (2017).
The Horseshoe+ Estimator of Ultra-Sparse Signals.
Bayesian Analysis, 12(4):1105 ā 1131.
- CandĆØs and Plan, (2009)
CandĆØs, E.Ā J. and Plan, Y. (2009).
Near-ideal model selection by 1 minimization.
The Annals of Statistics, 37(5A):2145 ā 2177.
- Chang etĀ al., (2011)
Chang, S.-H., Cosman, P.Ā C., and Milstein, L.Ā B. (2011).
Chernoff-type bounds for the gaussian error function.
IEEE Transactions on Communications, 59(11):2939ā2944.
- Davidson and Szarek, (2001)
Davidson, K.Ā R. and Szarek, S.Ā J. (2001).
Chapter 8 - local operator theory, random matrices and banach spaces.
In Handbook of the Geometry of Banach Spaces, volumeĀ 1, pages 317ā366. Elsevier Science B.V.
- Donoho, (2006)
Donoho, D.Ā L. (2006).
Compressed sensing.
IEEE Transactions on Information Theory, 52(4):1289ā1306.
- Donoho etĀ al., (1992)
Donoho, D.Ā L., Johnson, I.Ā M., Hoch, J.Ā C., and Stern, A.Ā S. (1992).
Maximum Entropy and the Nearly Black Object.
Journal of the Royal Statistical Society. Series B (Methodological), 54(1):41 ā 81.
- Donoho etĀ al., (2009)
Donoho, D.Ā L., Maleki, A., and Montanari, A. (2009).
Message-passing algorithms for compressed sensing.
Proceedings of the National Academy of Sciences, 106(45):18914ā18919.
- Dossal etĀ al., (2012)
Dossal, C., Chabanol, M.-L., PeyrƩ, G., and Fadili, J. (2012).
Sharp support recovery from noisy random measurements by 1-minimization.
Applied and Computational Harmonic Analysis, 33(1):24ā43.
- Fletcher etĀ al., (2009)
Fletcher, A.Ā K., Rangan, S., and Goyal, V.Ā K. (2009).
Necessary and sufficient conditions for sparsity pattern recovery.
IEEE Transactions on Information Theory, 55(12):5758ā5772.
- Ghiringhelli etĀ al., (2015)
Ghiringhelli, L.Ā M., Vybiral, J., Levchenko, S.Ā V., Draxl, C., and Scheffler, M. (2015).
Big data of materials science: Critical role of the descriptor.
Physical Review Letters, 114:105503.
- Guo and VerdĆŗ, (2005)
Guo, D. and VerdĆŗ, S. (2005).
Randomly spread CDMA: asymptotics via statistical physics.
IEEE Transactions on Information Theory, 51(6):1983ā2010.
- Kabashima, (2003)
Kabashima, Y. (2003).
A CDMA multiuser detection algorithm on the basis of belief propagation.
Journal of Physics A: Mathematical and General, 36(43):11111ā11121.
- Kabashima etĀ al., (2009)
Kabashima, Y., Wadayama, T., and Tanaka, T. (2009).
A typical reconstruction limit for compressed sensing based on -norm minimization.
Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003.
- Kim etĀ al., (2016)
Kim, C., Pilania, G., and Ramprasad, R. (2016).
From organized high-throughput data to phenomenological theory using machine learning: The example of dielectric breakdown.
Chemistry of Materials, 28(5):1304ā1311.
- Laurent and Massart, (2000)
Laurent, B. and Massart, P. (2000).
Adaptive estimation of a quadratic functional by model selection.
The Annals of Statistics, 28(5):1302 ā 1338.
- Meinshausen and Bühlmann, (2006)
Meinshausen, N. and Bühlmann, P. (2006).
High-dimensional graphs and variable selection with the Lasso.
The Annals of Statistics, 34(3):1436 ā 1462.
- (20)
Meng, X., Obuchi, T., and Kabashima, Y. (2021a).
Ising model selection using -regularized linear regression: A statistical mechanics analysis.
In Advances in Neural Information Processing Systems, volumeĀ 34, pages 6290ā6303.
- (21)
Meng, X., Obuchi, T., and Kabashima, Y. (2021b).
Structure learning in inverse Ising problems using -regularized linear estimator.
Journal of Statistical Mechanics: Theory and Experiment, 2021(5):053403.
- MƩzard and Montanari, (2009)
MƩzard, M. and Montanari, A. (2009).
Information, Physics, and Computation.
Oxford University Press, Inc., USA.
- Mézard et al., (1986)
MƩzard, M., Parisi, G., and Virasoro, M. (1986).
Spin Glass Theory and Beyond.
WORLD SCIENTIFIC.
- Osborne etĀ al., (2000)
Osborne, M.Ā R., Presnell, B., and Turlach, B.Ā A. (2000).
On the lasso and its dual.
Journal of Computational and Graphical Statistics, 9(2):319ā337.
- Pilania etĀ al., (2016)
Pilania, G., Mannodi-Kanakkithodi, A., Uberuaga, B.Ā P., Ramprasad, R., Gubernatis, J.Ā E., and Lookman, T. (2016).
Machine learning bandgaps of double perovskites.
Scientific Reports, 6(1):19375.
- Rush and Venkataramanan, (2018)
Rush, C. and Venkataramanan, R. (2018).
Finite sample analysis of approximate message passing algorithms.
IEEE Transactions on Information Theory, 64(11):7264ā7286.
- Stojnic, (2013)
Stojnic, M. (2013).
A framework to characterize performance of lasso algorithms.
arXiv, https://arxiv.org/abs/1303.7291.
- Su etĀ al., (2017)
Su, W., Bogdan, M., and CandĆØs, E. (2017).
False discoveries occur early on the lasso path.
The Annals of Statistics, 45(5):2133ā2150.
- Takahashi and Kabashima, (2022)
Takahashi, T. and Kabashima, Y. (2022).
Macroscopic analysis of vector approximate message passing in a model-mismatched setting.
IEEE Transactions on Information Theory, 68(8):5579ā5600.
- Thrampoulidis etĀ al., (2018)
Thrampoulidis, C., Abbasi, E., and Hassibi, B. (2018).
Precise error analysis of regularized -estimators in high dimensions.
IEEE Transactions on Information Theory, 64(8):5592ā5628.
- Tibshirani, (1996)
Tibshirani, R. (1996).
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267ā288.
- Tibshirani and Taylor, (2012)
Tibshirani, R.Ā J. and Taylor, J. (2012).
Degrees of freedom in lasso problems.
The Annals of Statistics, 40(2):1198 ā 1232.
- Vehkaperä et al., (2016)
VehkaperƤ, M., Kabashima, Y., and Chatterjee, S. (2016).
Analysis of regularized ls reconstruction and random matrix ensembles in compressed sensing.
IEEE Transactions on Information Theory, 62(4):2100ā2124.
- (34)
Wainwright, M.Ā J. (2009a).
Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting.
IEEE Transactions on Information Theory, 55(12):5728ā5741.
- (35)
Wainwright, M.Ā J. (2009b).
Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso).
IEEE Transactions on Information Theory, 55(5):2183ā2202.
- Wang etĀ al., (2020)
Wang, H., Yang, Y., Bu, Z., and Su, W. (2020).
The complete lasso tradeoff diagram.
In Advances in Neural Information Processing Systems, volumeĀ 33, pages 20051ā20060. Curran Associates, Inc.
- ZdeborovĆ and Krzakala, (2016)
ZdeborovĆ , L. and Krzakala, F. (2016).
Statistical physics of inference: thresholds and algorithms.
Advances in Physics, 65(5):453ā552.
- Zhang and Huang, (2008)
Zhang, C.-H. and Huang, J. (2008).
The sparsity and bias of the Lasso selection in high-dimensional linear regression.
The Annals of Statistics, 36(4):1567 ā 1594.
- Zhao and Yu, (2006)
Zhao, P. and Yu, B. (2006).
On model selection consistency of lasso.
Journal of Machine Learning Research, 7(90):2541ā2563.
Supplementary Materials
Appendix A Detailed derivation of Claim 1
Here, we derive the expression in Claim 1; see Figure 4 for an outline of the calcuation. For simplicity, we abbreviate , the average over excluding the submatrix acting on , as , and as the submatrix of excluding . Using the shorthand expression and , can be written as
[TABLE]
Using the Fourier representation, the average of the delta functions over is given by
[TABLE]
where we defined the matrix as and used the notation without confusion. This implies that the vector is Gaussian with covariance matrix . Now, the replica symmetric ansatz (9) implies that the integral over is dominated by the subspace of the form
[TABLE]
which allows us to simplify the profile of as
[TABLE]
where and are all i.i.d. standard Gaussian variables. Using (43)ā(45) yields the expression
[TABLE]
where is given by
[TABLE]
The integral with respect to can be evaluated using Laplaceās method for large , yielding
[TABLE]
where the subleading terms are ignored. Similarily, is given by
[TABLE]
Therefore, ignoring the subleading term with respect to ,
[TABLE]
The log of the integral with respect to can be expanded as
[TABLE]
where Laplaceās approximation was used for large to obtain the third line. Substituting (47), (A) and (49) into (46), using the saddle point method for large results in
[TABLE]
Noticing that
[TABLE]
and finally rescaling and , one obtains (14).
Appendix B Proof of auxiliary lemmas
B.1 Proof of Lemma 4
From (16) and (2.1), we have \chi=$$f\quantity(\frac{\tilde{N}}{M-\bar{d}}{\rm erfc}\quantity(\sqrt{\frac{\Lambda}{2\hat{\chi}}})), where , and \bar{d}$$:=\int D\bm{z}\norm{\hat{\bm{x}}_{(1+\chi)\lambda}(\sqrt{Q}\bm{z}+\bm{y})}_{0}. From , we see that is a increasing function. By using the Markov inequality, it can be deduced that
[TABLE]
which proves the first part of the lemma with . For the probability bound on , using ,
[TABLE]
where
[TABLE]
Using for , we have for large enough ,
[TABLE]
Since both and are nonnegative and increasing, for large enough ,
[TABLE]
The Markov inequality then implies the second part of the lemma with :
[TABLE]
B.2 Proof of Lemma 5
Equations (16) and (2.1) imply that holds for any . Thus, is deterministically upper-bounded as
[TABLE]
Now, satisfies (Chang etĀ al., , 2011) for ,
[TABLE]
Applying this inequality to (52) with and for yields
[TABLE]
Appendix C Additional numerical experiments : Necessary condition for perfect support recovery
To verify the necessary sample complexity for perfect recovery given by Claim 5, numerical experiments were conducted. The profile of is the same as that of Section 3.1, and the regularization parameter is taken as . Figure 5 shows the perfect support recovery probability for noise distributed according to the Gaussian, uniform, and Laplace distribution. Clearly, for all three cases, perfect recovery fails with finite probability as tends to infinity when is less than the value indicated by Claim 5.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbara et al., (2020) Abbara, A., Kabashima, Y., Obuchi, T., and Xu, Y. (2020). Learning performance in inverse Ising problems with sparse teacher couplings. Journal of Statistical Mechanics: Theory and Experiment , 2020(7):073402.
- 2Bayati and Montanari, (2011) Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory , 57(2):764ā785.
- 3Bayati and Montanari, (2012) Bayati, M. and Montanari, A. (2012). The lasso risk for gaussian matrices. IEEE Transactions on Information Theory , 58(4):1997ā2017.
- 4Bhadra et al., (2017) Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The Horseshoe+ Estimator of Ultra-Sparse Signals. Bayesian Analysis , 12(4):1105 ā 1131.
- 5CandĆØs and Plan, (2009) CandĆØs, E. J. and Plan, Y. (2009). Near-ideal model selection by ā ā \ell 1 minimization. The Annals of Statistics , 37(5A):2145 ā 2177.
- 6Chang et al., (2011) Chang, S.-H., Cosman, P. C., and Milstein, L. B. (2011). Chernoff-type bounds for the gaussian error function. IEEE Transactions on Communications , 59(11):2939ā2944.
- 7Davidson and Szarek, (2001) Davidson, K. R. and Szarek, S. J. (2001). Chapter 8 - local operator theory, random matrices and banach spaces. In Handbook of the Geometry of Banach Spaces , volume 1, pages 317ā366. Elsevier Science B.V.
- 8Donoho, (2006) Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory , 52(4):1289ā1306.
