Average case analysis of Lasso under ultra-sparse conditions

Koki Okajima; Xiangming Meng; Takashi Takahashi; Yoshiyuki Kabashima

arXiv:2302.13093·cond-mat.dis-nn·February 28, 2023

Average case analysis of Lasso under ultra-sparse conditions

Koki Okajima, Xiangming Meng, Takashi Takahashi, Yoshiyuki Kabashima

PDF

Open Access

TL;DR

This paper provides an average-case analysis of Lasso in ultra-sparse linear models using a novel replica method approach, offering insights into support recovery and performance without restrictive assumptions.

Contribution

It introduces a new analytical framework for Lasso's performance in ultra-sparse regimes, extending previous results to more general settings and noise conditions.

Findings

01

Provides a lower bound on sample complexity for support recovery

02

Generalizes previous bounds to non-Gaussian noise

03

Supports analysis with extensive numerical experiments

Abstract

We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors $N$ grows larger keeping the true support size $d$ finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where $N$ , $d$ and the number of observations $M$ tend to infinity at the same rate. Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of $N$ and $M$ , the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when $M$ diverges as $M = O (lo g N)$ . The obtained bound…

Tables1

Table 1. Table 1: Values of α C subscript 𝛼 𝐶 \alpha_{C} evaluated from figure 2.

$𝒙_{S}^{0}$	$(λ, σ^{2})$	$α_{C}$
$[1, 1, 1]$	$(0.5, 0.0)$	6.00
$[1, 1, 1]$	$(0.5, 0.5)$	10.0
$[1 / 3, 2 / 3, 1]$	$(0.5, 0.0)$	4.89
$[1 / 3, 2 / 3, 1]$	$(0.5, 0.5)$	8.89

Equations163

y = A x^{0} + ξ,

y = A x^{0} + ξ,

\hat{x}_{λ} (A, y) := x arg min \quantity (\frac{1}{2} \norm A x - y^{2} + M λ \norm x_{1}),

\hat{x}_{λ} (A, y) := x arg min \quantity (\frac{1}{2} \norm A x - y^{2} + M λ \norm x_{1}),

E_{A, y} (\dots)

E_{A, y} (\dots)

= \int \differential y \differential A \differential ξ p_{ξ} (ξ) (\dots) \frac{e ^{- \frac{1}{2} Tr A^{T} A}}{( 2 π ) ^{(N M /2)}} δ (y - A x^{0} - ξ),

FP (A, y)

FP (A, y)

TP (A, y)

P_{β} (x ∣ A, y) := Z_{β}^{- 1} \quantity (A, y) exp \quantity (- \frac{β}{2} \norm A x - y^{2} - β M λ \norm x_{1}),

P_{β} (x ∣ A, y) := Z_{β}^{- 1} \quantity (A, y) exp \quantity (- \frac{β}{2} \norm A x - y^{2} - β M λ \norm x_{1}),

F = - β \to \infty lim β^{- 1} E_{A, y} lo g Z_{β} (A, y) .

F = - β \to \infty lim β^{- 1} E_{A, y} lo g Z_{β} (A, y) .

E_{A, y} lo g Z_{β} (A, y) = n \to + 0 lim \frac{E _{A, y} Z _{β}^{n} ( A , y ) - 1}{n} .

E_{A, y} lo g Z_{β} (A, y) = n \to + 0 lim \frac{E _{A, y} Z _{β}^{n} ( A , y ) - 1}{n} .

E_{A, y} Z_{β}^{n} (A, y) = E_{A, y} \int a = 1 \prod n \differential x^{a} exp \quantity (- \frac{β}{2} a = 1 \sum n \norm A x^{a} - y^{2} - β M λ a = 1 \sum n \norm x^{a}_{1}) .

E_{A, y} Z_{β}^{n} (A, y) = E_{A, y} \int a = 1 \prod n \differential x^{a} exp \quantity (- \frac{β}{2} a = 1 \sum n \norm A x^{a} - y^{2} - β M λ a = 1 \sum n \norm x^{a}_{1}) .

i \in / S \sum Δ_{i}^{a} Δ_{i}^{b} := {Q Q - χ / β a = b otherwise,

i \in / S \sum Δ_{i}^{a} Δ_{i}^{b} := {Q Q - χ / β a = b otherwise,

E_{A_{S}, y} \int a = 1 \prod n \differential Δ^{a} \int \differential Q \differential χ e^{- β M λ \sum_{a = 1}^{n} \norm Δ^{a}_{1}} I L,

E_{A_{S}, y} \int a = 1 \prod n \differential Δ^{a} \int \differential Q \differential χ e^{- β M λ \sum_{a = 1}^{n} \norm Δ^{a}_{1}} I L,

I := a = 1 \prod n δ \quantity (Q - i \in / S \sum (Δ_{i}^{a})^{2}) a < b \prod δ \quantity (Q - \frac{χ}{β} - i \in / S \sum Δ_{i}^{a} Δ_{i}^{b}),

I := a = 1 \prod n δ \quantity (Q - i \in / S \sum (Δ_{i}^{a})^{2}) a < b \prod δ \quantity (Q - \frac{χ}{β} - i \in / S \sum Δ_{i}^{a} Δ_{i}^{b}),

L := \int D z \quantity (\int \differential x_{S} e^{- M β G (x_{S}; z)})^{n}, G (x_{S}; z) := \frac{\norm A _{S} x _{S} + Q z - y ^{2}}{2 M ( 1 + χ )} + λ \norm x_{S}_{1} .

L := \int D z \quantity (\int \differential x_{S} e^{- M β G (x_{S}; z)})^{n}, G (x_{S}; z) := \frac{\norm A _{S} x _{S} + Q z - y ^{2}}{2 M ( 1 + χ )} + λ \norm x_{S}_{1} .

I = \int_{- i \infty}^{+ i \infty} \differential \hat{Q} \differential \overset{χ}{^} e^{\frac{M n β}{2} \quantity (Q \hat{Q} + (n - 1) χ \overset{χ}{^} - n β Q \overset{χ}{^})} \times \int D \hat{z} e^{- \frac{M β Q ^}{2} \sum_{a = 1}^{n} \norm Δ^{a}^{2} + β M \overset{χ}{^} z^{T} Δ^{a} + o (β)} .

I = \int_{- i \infty}^{+ i \infty} \differential \hat{Q} \differential \overset{χ}{^} e^{\frac{M n β}{2} \quantity (Q \hat{Q} + (n - 1) χ \overset{χ}{^} - n β Q \overset{χ}{^})} \times \int D \hat{z} e^{- \frac{M β Q ^}{2} \sum_{a = 1}^{n} \norm Δ^{a}^{2} + β M \overset{χ}{^} z^{T} Δ^{a} + o (β)} .

\begin{gathered}\mathcal{F}=\mathbb{E}_{\mathbf{A}_{S},\bm{y}}\ \operatorname*{{\rm Extr}}_{\Theta}\Bigg{\{}-\frac{Q\hat{Q}-\chi\hat{\chi}}{2}\\ -\frac{\tilde{N}}{2\hat{Q}}\quantity[(\Lambda+\hat{\chi}){\rm erfc}\quantity(\sqrt{\frac{\Lambda}{2\hat{\chi}}})-\sqrt{\frac{2\Lambda\hat{\chi}}{\pi}}e^{-\Lambda/2\hat{\chi}}]\\ +\int D\bm{z}\min_{\bm{x}_{S}}G(\bm{x}_{S};\bm{z})\Bigg{\}}.\end{gathered}

\begin{gathered}\mathcal{F}=\mathbb{E}_{\mathbf{A}_{S},\bm{y}}\ \operatorname*{{\rm Extr}}_{\Theta}\Bigg{\{}-\frac{Q\hat{Q}-\chi\hat{\chi}}{2}\\ -\frac{\tilde{N}}{2\hat{Q}}\quantity[(\Lambda+\hat{\chi}){\rm erfc}\quantity(\sqrt{\frac{\Lambda}{2\hat{\chi}}})-\sqrt{\frac{2\Lambda\hat{\chi}}{\pi}}e^{-\Lambda/2\hat{\chi}}]\\ +\int D\bm{z}\min_{\bm{x}_{S}}G(\bm{x}_{S};\bm{z})\Bigg{\}}.\end{gathered}

Q

Q

χ

\hat{Q}

= \frac{M - \int D z \norm x ^ _{(1 + χ) λ} ( A _{S} , Q z + y ) _{0}}{1 + χ},

\overset{χ}{^}

(\hat{x}_{λ} (A, y))_{i \in / S} \sim g_{λ} (\hat{Q}, \overset{χ}{^} z_{i}) = x min \quantity (\frac{Q ^}{2} x^{2} - \overset{χ}{^} z_{i} x + M λ \absolutevalue x),

(\hat{x}_{λ} (A, y))_{i \in / S} \sim g_{λ} (\hat{Q}, \overset{χ}{^} z_{i}) = x min \quantity (\frac{Q ^}{2} x^{2} - \overset{χ}{^} z_{i} x + M λ \absolutevalue x),

⟨ Ψ (x) ⟩ := β \to \infty lim E_{A, y} \int \differential x P_{β} (x ∣ A, y) Ψ (x) = - β \to \infty lim h \to 0 lim \partialderivative h β^{- 1} E_{A, y} lo g Z_{β} (A, y; h Ψ),

⟨ Ψ (x) ⟩ := β \to \infty lim E_{A, y} \int \differential x P_{β} (x ∣ A, y) Ψ (x) = - β \to \infty lim h \to 0 lim \partialderivative h β^{- 1} E_{A, y} lo g Z_{β} (A, y; h Ψ),

Z_{β} (A, y; h Ψ) := \int \differential x e^{- \frac{β}{2} \norm A x - y^{2} - β M λ \norm x_{1} - β h Ψ (x)} .

Z_{β} (A, y; h Ψ) := \int \differential x e^{- \frac{β}{2} \norm A x - y^{2} - β M λ \norm x_{1} - β h Ψ (x)} .

⟨ i \in / S \sum ψ (x_{i}) ⟩ = \tilde{N} E_{A_{S}, y} \int D z ψ (g_{λ} (\hat{Q}, \overset{χ}{^} z)),

⟨ i \in / S \sum ψ (x_{i}) ⟩ = \tilde{N} E_{A_{S}, y} \int D z ψ (g_{λ} (\hat{Q}, \overset{χ}{^} z)),

⟨ Ψ (x_{S}) ⟩ = E_{eff} Ψ (\hat{x}_{(1 + χ) λ} (A_{S}, Q z + y)),

⟨ Ψ (x_{S}) ⟩ = E_{eff} Ψ (\hat{x}_{(1 + χ) λ} (A_{S}, Q z + y)),

⟨ TP ⟩

⟨ TP ⟩

⟨ FP ⟩

⟨ ϵ_{x} ⟩

s_{λ}^{(M)} := \frac{1}{M} \norm γ_{λ} (y) - y^{2}

s_{λ}^{(M)} := \frac{1}{M} \norm γ_{λ} (y) - y^{2}

Γ^{(M)} := \frac{1}{M} \int \differential ξ p_{ξ} (ξ) \norm ξ^{2} < C

Γ^{(M)} := \frac{1}{M} \int \differential ξ p_{ξ} (ξ) \norm ξ^{2} < C

α (1 + ϵ) > α_{C} = \frac{s ˉ _{λ}}{2 λ ^{2}},

α (1 + ϵ) > α_{C} = \frac{s ˉ _{λ}}{2 λ ^{2}},

α (1 + ϵ) > 2 \quantity (d + \frac{Γ}{λ ^{2}}),

α (1 + ϵ) > 2 \quantity (d + \frac{Γ}{λ ^{2}}),

Pr \quantity [λ_{m a x} (B) \geq \quantity (M + d + t)^{2}] \leq e^{- \frac{t ^{2}}{2}}

Pr \quantity [λ_{m a x} (B) \geq \quantity (M + d + t)^{2}] \leq e^{- \frac{t ^{2}}{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models

Full text

Average case analysis of Lasso under ultra-sparse conditions

Koki Okajima

Xiangming Meng

Takashi Takahashi

Yoshiyuki Kabashima

Department of Physics, The University of Tokyo

Abstract

We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors $N$ grows larger keeping the true support size $d$ finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where $N$ , $d$ and the number of observations $M$ tend to infinity at the same rate.

Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of $N$ and $M$ , the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when $M$ diverges as $M=O(\log N)$ . The obtained bound for perfect support recovery is a generalization of that given in previous literature, which only considers the case of Gaussian noise and diverging $d$ . Extensive numerical experiments strongly support our analysis.

1 Introduction

An important objective of high dimensional statistics is to extract information in situations where the signal’s dimension $N$ is overwhelmingly large compared to the accumulated sample size $M$ . It is crucial to incorporate prior knowledge on the signal structure to reduce the signal space dimension for reliable estimation. A particularly common assumption is sparsity, which postulates that the true signal has few nonzero entries. Exploiting this property allows one to obtain robust and interpretable results specifying the few relevant variables explaining the retrieved data (Donoho, , 2006).

For instance, consider the sparse linear regression problem where measurements $\bm{y}\in\mathbb{R}^{M}$ of the signal $\bm{x}^{0}\in\mathbb{R}^{N}$ with $d$ non-zero components are given by the linear model

[TABLE]

where $\mathbf{A}\in\mathbb{R}^{M\times N}$ is the sensing matrix, and $\bm{\xi}\in\mathbb{R}^{M}$ is the noise vector distributed according to $p_{\xi}(\bm{\xi})$ . The most fundamental yet popular sparse signal estimation method is the least absolute shrinkage and selection operator (Lasso) (Tibshirani, , 1996), which offers the estimator by solving the following convex program:

[TABLE]

where $\lambda$ is a regularization parameter. Since its introduction, this simple $\ell_{1}$ -regularization scheme has been successfully adapted as a backbone technique for solving a wide variety of sparse estimation problems. A particularly interesting question to ask is if one can make any guarantees on the performance of Lasso under general scalings of $(N,M,d)$ , its dependence on $\lambda$ , and statistical properties of the noise and true signal.

A sheer amount of research has been devoted to assessing the performance of Lasso. Traditionally, research based on the irrepresentability condition (Meinshausen and Bühlmann, , 2006; Zhao and Yu, , 2006) has been popular in establishing guarantees in terms of support recovery of the sparse signal (Wainwright, 2009b, ; Dossal et al., , 2012; Meinshausen and Bühlmann, , 2006; Zhang and Huang, , 2008; Candès and Plan, , 2009; Zhao and Yu, , 2006). A different approach based on approximate message-passing (AMP) theory (Donoho et al., , 2009), and the heuristical replica method (Mézard et al., , 1986) from statistical physics has focused on assessing the sharp, asymptotic properties of Lasso in the large $N$ and $M$ limit under random sensing matrix designs. Despite the previous works, the understanding of the Lasso estimator is still limited. Analysis based on the irrepresentability condition often offers only scaling guarantees with respect to $(N,M,d)$ , or statements with strong assumptions on the regularization parameter. Besides, the AMP/replica-based analysis has been only limited to linear sparsity, i.e. $d/N=O(1)$ and $M/N=O(1)$ as $N\to\infty$ , which may be somewhat unrealistic compared to real-world situations.

1.1 Contributions

In this work, we complement the drawbacks in both the irrepresentability condition approach and AMP / replica approach by theoretically analyzing the average performance of Lasso when $d=O(1)$ , i.e. the ultra-sparse case (Donoho et al., , 1992; Bhadra et al., , 2017), which is a more typical situation in certain applications such as materials informatics (Ghiringhelli et al., , 2015; Kim et al., , 2016; Pilania et al., , 2016). Moreover, our result offers a necessary condition for support recovery in the limit $N,M\to\infty$ . Specifically, our contributions are summarized as follows:

•

We provide a new way to apply the replica method in the ultra-sparsity regime. This is done by explicitly handling the correlations and finite-size effects acting on the active set ${\rm supp}(\bm{x}^{0})=\quantity{i\ |\ x_{i}^{0},\neq 0\ i=1,\cdots,N}$ , which is otherwise ignored in conventional analysis (Section 2.1, Claim 1).

•

Using this enhanced replica method, we precisely evaluate the average property of Lasso under ultra-sparsity and standard Gaussian matrix design, i.e. each element of $\mathbf{A}$ is i.i.d. according to a standard Gaussian distribution. This provides an extension to previous results derived from the AMP theory and the replica method, where linear sparsity is necessary for the analysis (Section 2.2, Claim 2).

•

We derive a necessary condition for partial support recovery ${\rm supp}(\hat{\mathbf{x}}_{\lambda}(\mathbf{A},\mathbf{y}))\subseteq{\rm supp}(\bm{x}^{0})$ under some mild conditions (Assumption 1). Specifically, the number of false positives, and subsequently the model misselection probability vanishes only if $M>\alpha_{C}\log N$ for $N\to\infty$ . This constant $\alpha_{C}$ is determined by the mean prediction error of an oracle (Section 2.3, Claim 3, 4).

•

In addition to partial support recovery, the analysis also provides a necessary condition for perfect support recovery ${\rm supp}(\hat{\mathbf{x}}_{\lambda}(\mathbf{A},\mathbf{y}))={\rm supp}(\bm{x}^{0})$ , which generalizes the sample complexity bound given by Wainwright, 2009b for i.i.d. Gaussian noise distributions in the limit $d\to\infty$ to more general noise distributions under constant $d$ (Section 2.3, Claim 5).

•

We demonstrate that our theory agrees well with experiment by conducting extensive numerical simulations (Section 3).

Note that all of the results are derived from the enhanced replica method, which is yet to be proven rigorously; hence the statements are presented as claims.

1.2 Related Work

Irrepresentability Condition.

As aforementioned, the irrepresentability condition, first introduced by Meinshausen and Bühlmann, (2006) and Zhao and Yu, (2006), has been an important cornerstone, as it establishes a sufficient condition for perfect support recovery. This condition indicates whether the covariates, i.e. the columns of $\mathbf{A}$ , are linearly independent enough to be distinguishable from one another, and hence variable selection is relatively feasible. It has been revealed that Lasso is an “optimal” support estimator in the sublinear regime $d=o(N)$ , i.e. Lasso has its success/failure threshold for sample complexity in the same order as the informational-theoretical one (Fletcher et al., , 2009; Wainwright, 2009a, ). However, little is known about the constants involved in these conditions. Wainwright, 2009b provided necessary and sufficient conditions for perfect support recovery under random Gaussian matrices for diverging $d$ . This is a simple and explicit bound which depends on the regularization parameter and intensity of the noise, which is restricted to i.i.d. Gaussian. Focusing on the case $d=O(1)$ , Dossal et al., (2012) derived sufficient conditions for partial and perfect support recovery under deterministic noise, whose bound is similar to the one given in Wainwright, 2009b .

AMP theory.

A particular line of work has aimed in assessing the properties of Lasso under general random matrix designs via careful analysis of the dynamical behavior of the AMP algorithm (Kabashima, , 2003; Donoho et al., , 2009; Takahashi and Kabashima, , 2022), whose convergence point coincides with (2) in the large $N$ limit. Rather than establishing inequality bounds or conditions, the objective is to establish sharp results on the Lasso for a random instance of $(\mathbf{A},\bm{y})$ . Although analysis is limited to linear sparsity regime, powerful and precise results have been proven rigorously under this framework (Bayati and Montanari, , 2012). For instance, Su et al., (2017) and Wang et al., (2020) determine the possible rate of false positives and true positives achievable under certain settings, which can be obtained by solving a small set of nonlinear equations. Nevertheless, the analysis does not give insight on support recovery, since this is impossible in the linear sparsity regime (Fletcher et al., , 2009; Wainwright, 2009a, ).

Replica method.

Results similar to those from AMP theory have also been derived by using the non-rigorous replica method in statistical mechanics. Unlike AMP theory, which is based on a convergence analysis of a particular algorithm, the replica method aims at directly calculating the average over $(\mathbf{A},\bm{y})$ of a cumulant generating function for some probability distribution, i.e. of the form $K_{\phi}(t)=\mathbb{E}_{\mathbf{A},\bm{y}}\ \log\int\differential\bm{x}\ e^{t\phi(\bm{x})}p(\bm{x}|\mathbf{A},\bm{y})$ . This calculation is often encountered in the field of statistics, where one is interested in the average behavior of a statistical model. While lacking a complete proof, this method has been successful in predicting the average performance of machine learning and optimization methods under general random designs in the linear sparsity regime (Vehkaperä et al., , 2016; Zdeborovà and Krzakala, , 2016). In fact, under certain assumptions, the average predictions given by the replica method have been proven to be consistent with the asymptotic results obtained from AMP theory and other rigorous methods (Stojnic, , 2013; Thrampoulidis et al., , 2018). Similar to AMP theory, however, reliable adaptations of this method outside linear sparsity are still open problems. Previous research such as Abbara et al., (2020), Meng et al., 2021a and Meng et al., 2021b analyzed the performance of sparse Ising model selection using a variation of the replica method. However, this was accomplished through a series of ansatzes which are generally difficult to justify theoretically.

1.3 Preliminaries

Here we summarize the notations used in this paper. The expression $\norm{\cdot}$ denotes the $\ell_{2}$ norm. The active set $S$ is defined as the support of the $d$ -sparse true signal $\bm{x}^{0}$ , $S:={\rm supp}(\bm{x}^{0})=\{i\ |\ x^{0}_{i}\neq 0,\ i=1,\cdots,N\}$ . Define $\tilde{N}:=N-d$ , the size of the inactive set. The matrix $\mathbf{A}_{S}$ denotes the submatrix constructed by concatenating the columns of $\mathbf{A}$ with indices in $S$ . The vector $\bm{x}^{0}_{S}$ denotes the subvector of $\bm{x}^{0}$ with indices in $S$ . For simplicity, $\bm{x}^{0}$ is assumed to be a deterministic, although this can be extended to random signals trivially. The expression $\mathbb{E}_{\mathbf{A},\bm{y}}$ denotes the average over the joint probability with respect to the pair $(\mathbf{A},\bm{y})$ , i.e.

[TABLE]

where $\delta(\cdot)$ denotes the Dirac delta function. The definition of $\mathbb{E}_{\mathbf{A}_{S},\bm{y}}$ follows straightforwardly from the above. Also, define $D\bm{z}$ as the standard Gaussian measure, $D\bm{z}=\differential\bm{z}e^{-\norm{\bm{z}}^{2}/2}/(2\pi)^{n/2}$ for $\bm{z}\in\mathbb{R}^{n}$ . Given $(\mathbf{A},\bm{x}^{0},\bm{y})$ and regularization parameter $\lambda$ , the oracle Lasso estimator is defined as $\hat{\bm{x}}_{\lambda}(\mathbf{A}_{S},\bm{y})$ , which is the Lasso estimator with the true support identified beforehand. It is also convenient to define the oracle Lasso fit, defined by $\bm{\gamma}_{\lambda}(\bm{y}):=\mathbf{A}_{S}\hat{\bm{x}}_{\lambda}(\mathbf{A}_{S},\bm{y})$ , with its dependence on $\mathbf{A}_{S}$ suppressed for convenience.

Given configuration $(\mathbf{A},\bm{x}^{0},\bm{y})$ , and regularization parameter $\lambda$ , the number of false positives $\rm FP$ and the number of true positives $\rm TP$ of the lasso estimator is defined as

[TABLE]

where $S^{C}$ denotes the complement of set $S$ from $\quantity{1,\cdots,N}$ . Without confusion, the dependence on $(\mathbf{A}_{S},\bm{y})$ is suppressed for convenience.

We say that an event $A$ holds with asymptotically high probability (w.a.h.p.) if there exists a constant $c>0$ such that ${\Pr}[A]>1-O(N^{-c})$ . We also say that $A$ holds with probability approaching one (w.p.a.1) if ${\Pr}[A]>1-o(1)$ as $N\to\infty$ .

2 Replica analysis

Define the Boltzmann distribution as

[TABLE]

where $Z_{\beta}(\mathbf{A},\bm{y})$ is the normalization constant. Note that in the limit $\beta\to\infty$ , (5) converges to a point-wise distribution concentrated on the Lasso estimator $\hat{\bm{x}}_{\lambda}(\mathbf{A},\bm{y})$ . The main objective of our analysis is to calculate the average of the logarithm of $Z_{\beta}\quantity(\mathbf{A},\bm{y})$ over the random variables $(\mathbf{A},\bm{y})$ in the limit $\beta\to\infty$ , which is called the free energy or the cumulant generating function

[TABLE]

The properties of $\hat{\bm{x}}_{\lambda}(\mathbf{A},\bm{y})$ averaged over the population of $(\mathbf{A},\bm{y})$ can then be assessed by taking appropriate derivatives of $\mathcal{F}$ .

Although (6) is difficult to calculate straightforwardly, this can be resolved by using the replica method (Mézard and Montanari, , 2009; Mézard et al., , 1986), which is based on the following equality

[TABLE]

Instead of handling the cumbersome $\log$ expression in (6) directly, one calculates the average of the $n$ -th power of $Z_{\beta}$ for $n\in\mathbb{N}$ , analytically continues this expression to $n\in\mathbb{R}$ , and finally takes the limit $n\to+0$ . Based on this replica ”trick”, it suffices to calculate

[TABLE]

up to the first order of $n$ to take the $n\to+0$ limit in the right hand side of (7).

2.1 Outline of the derivation

Here, we only give a brief outline of the derivation; for details, see Supplementary Materials. Rewriting $\Delta^{a}_{i}:=x^{a}_{i}\ (i\notin S)$ , it is convenient to introduce the auxillary variable $h_{\mu}^{a}\equiv\sum_{i\notin S}A_{\mu i}\Delta_{i}^{a}\ (\mu=1,\cdots,M)$ , which accounts for the effect from the variables not in the true support in each replica $a$ . A crucial observation is that $\quantity{A_{\mu i}}_{1\leq\mu\leq M,\\ i\notin S}$ is statistically independent from $(\mathbf{A}_{S}$ , $\bm{y})$ , which allows the average to be taken individually.

By taking the average over the Gaussian variables $\quantity{A_{\mu i}}_{1\leq\mu\leq M,\\ i\notin S}$ first, we find that $h_{\mu}^{a}$ is Gaussian with zero mean and covariance $\mathbb{E}h_{\mu}^{a}h_{\nu}^{b}=\delta_{\mu\nu}\sum_{i\notin S}\Delta_{i}^{a}\Delta_{i}^{b}$ . By assuming the replica symmetric (RS) ansatz (Mézard et al., , 1986)

[TABLE]

the integral for the replicated vectors $\quantity{\bm{\Delta}^{a}}_{a=1}^{n}$ over the whole $\mathbb{R}^{\tilde{N}\times n}$ space is restricted to a subspace satisfying the constraints (9). More explicitly, one can rewrite (8) as

[TABLE]

where $\mathcal{I}$ corresponds to the contribution from the RS constraint: i.e.

[TABLE]

and $\mathcal{L}$ is the contribution from the second line of (8), albeit simplified as a result of replica symmetry:

[TABLE]

By using the Fourier representation of the delta function, (11) can be further rewritten as

[TABLE]

Using this expression, the integral with respect to $\{\bm{\Delta}^{a}\}_{1\leq a\leq n}$ in (10) can be calculated analytically. Performing the saddle point approximation for large $M$ to the integrals with respect to $(Q,\hat{Q},\chi,\hat{\chi})$ , and finally taking the limit $\beta\to\infty$ after $n\to+0$ in (7) yields the following expression for $\mathcal{F}$ .

Claim 1.

The free energy is given by

[TABLE]

Here, $\Lambda:=(M\lambda)^{2}$ , ${\rm erfc}$ is the complementary error function ${\rm erfc}(x):=2/\sqrt{\pi}\int_{x}^{\infty}\differential te^{-t^{2}}$ , and $\operatorname*{{\rm Extr}}$ refers to the extremum condition with respect to $\Theta:=(Q,\hat{Q},\chi,\hat{\chi})$ , which are random variables dependent on $(\mathbf{A}_{S},\bm{y})$ .

Straightforward calculation shows that the extremum conditions are given by

[TABLE]

where the second equality in (2.1) is from Theorem 1 in Tibshirani and Taylor, (2012). Note that the dependence of $\Theta$ on $(\mathbf{A}_{S},\bm{y})$ is not explicitly written for sake of simplicity. This evaluation of $\mathcal{F}$ reduces the high-dimensional integral over $\mathbf{A}$ and $\bm{y}$ to an average over a four-dimensional extremum problem involving a $M$ –dimensional integral with respect to $\bm{z}$ , which can be numerically computed via iterative substitution and Monte Carlo sampling over $(\bm{A}_{S},\bm{y})$ and $\bm{z}$ .

It is interesting to compare our replica analysis in the large $N$ and $M$ limit to the ones considering linear sparsity (Kabashima et al., , 2009; Vehkaperä et al., , 2016). In linear sparsity, the lasso estimator’s statistical property can effectively be described by a population of $N$ decoupled, independent scalar estimators under Gaussian noise with identical intensity as $N\to\infty$ . This is often referred to as the decoupling principle in information theory; see Guo and Verdú, (2005) and Bayati and Montanari, (2011) for details. In the ultra-sparse case, the elements of the Lasso estimator in the active set, consisting of $d=O(1)$ terms, cannot be expected to decouple, as finite-size effects of non-Gaussian and correlated nature are expected to be significant to describe its profile. This is why a $d-$ body optimization procedure and the average with respect to $(\mathbf{A}_{S},\mathbf{y})$ appears explicitly in (14). On the other hand, the decoupling principle is implicitly employed for the $\tilde{N}$ non-active variables conditioned on $(\mathbf{A}_{S},\bm{y})$ . More explicitly, for each configuration of $(\mathbf{A}_{S},\bm{y})$ , each element of the non-active Lasso estimator is statistically equivalent to

[TABLE]

where $z_{i}$ are i.i.d. according to $\mathcal{N}(0,1)$ . Note that the decoupling principle, rigorously proven under AMP theory, does not necessarily need $N$ and $M$ to diverge at the same rate (Rush and Venkataramanan, , 2018).

2.2 Performance assessment of Lasso

The free energy allows convenient evaluation of averages of certain functions of the estimator. More explicitly, for a function $\Psi:\mathbb{R}^{N}\to\mathbb{R}$ , its average with respect to the Boltzmann distribution (5) and $(\mathbf{A},\bm{y})$ is given by

[TABLE]

where

[TABLE]

For a class of functions $\Psi$ , the above can be calculated trivially, which we state in the following claim:

Claim 2 (Average with respect to active and inactive sets).

For arbitrary functions $\psi:\mathbb{R}\to\mathbb{R}$ and $\Psi:\mathbb{R}^{d}\to\mathbb{R}$ , we have

[TABLE]

and

[TABLE]

where $\mathbb{E}_{\rm eff}:=\mathbb{E}_{\mathbf{A}_{S},\bm{y}}\int D\bm{z}$ , and $(Q,\hat{Q},\hat{\chi},\chi)$ is given by the solution of the extremum conditions (15)–(18) for each $(\mathbf{A}_{S},\bm{y})$ . In particular, performance measures such as the average of true positives ( ${\mathrm{TP}}$ ), false positives ( ${\mathrm{FP}}$ ) and $\ell_{2}$ error $\epsilon_{x}:=\norm{\bm{x}_{\lambda}(\mathbf{A},\bm{y})-\bm{x}^{0}}^{2}$ is given by

[TABLE]

2.3 Necessary condition for support recovery

A particular topic of interest is partial support recovery, and the minimum number of samples $M$ necessary for the false positives to vanish in the limit $N\to\infty$ . Although the fixed point equations (15) –(18) do not admit a closed form solution, a necessary condition in terms of the sample complexity can be derived under the following mild conditions:

Assumption 1.

**

A:

(Uniqueness of fixed point) The solutions of the fixed point equations (15)–(18) are unique and satisfy $(Q,\hat{Q},\chi,\hat{\chi})\in(0,\infty)^{4}$ . 2. B:

(Concentration of the oracle Lasso estimator) The random variable

[TABLE]

*has finite mean $\bar{s}_{\lambda}^{(M)}$ and variance converging to zero. *** 3. C:

(Bounded variance of noise distribution) The distribution $p_{\xi}$ satisfies

[TABLE]

for some constant $C$ .

Claim 3 (Necessary sample complexity for asymptotically zero false positives).

*Let $M$ diverge with $N$ with scaling $M=\alpha\log N\ (\alpha>0)$ . Under Claim 1 and Assumption 1, if there exists a constant $c>0$ such that $\left\langle{{\mathrm{FP}}}\right\rangle<O(N^{-c})$ in the limit $N\to\infty$ , then *

[TABLE]

holds for any constant $\epsilon>0$ , where $\displaystyle\bar{s}_{\lambda}=\lim_{M\to\infty}\bar{s}_{\lambda}^{(M)}.$

The proof is postponed to Section 4. From this claim, the necessary sample complexity for partial support recovery follows immediately:

Claim 4 (Necessary sample complexity for partial support recovery).

Under the settings in Claim 3, if ${\rm supp}(\hat{\bm{x}}_{\lambda}(\mathbf{A},\bm{y}))\subseteq{\rm supp}(\bm{x}^{0})$ w.a.h.p., then $\alpha>\alpha_{C}$ .

By definition, $\bar{s}_{\lambda}$ is the prediction error of the oracle, which is given the sensing submatrix $\mathbf{A}_{S}$ and observation vector $\bm{y}$ . This is reminiscent of the primal-dual witness construction in Wainwright, 2009b , where sufficient conditions for asymptotically zero FPs are derived by solving the oracle Lasso first, and observing whether the oracle solution concatenated with $N-d$ zero elements is a unique solution of the original Lasso problem (2).

Furthermore, the necessary condition for perfect support recovery can also be derived using Claim 3.

Claim 5 (Necessary sample complexity for perfect support recovery).

Under the settings in Claim 3, suppose ${\rm supp}(\hat{\bm{x}}_{\lambda}(\mathbf{A},\bm{y}))={\rm supp}(\bm{x}^{0})$ holds w.a.h.p. Then

[TABLE]

holds for any constant $\epsilon>0$ , where $\displaystyle\Gamma=\lim_{M\to\infty}\Gamma^{(M)}$ .

Note that in the special case of Gaussian noise with variance $\sigma^{2}$ , we have $\Gamma=\sigma^{2}$ , which extends the result of Wainwright, 2009b , Theorem 4 to the case $d=O(1)$ . Moreover, our result can be applied to any noise distribution satisfying Assumption 1.C.

3 Numerical experiments

3.1 Non-asymptotic results

To verify the derived results based on Claim 1, numerical experiments were conducted. For simplicity, we consider the case where the active set has size $d=3$ with $\bm{x}^{0}_{S}=\bm{1}_{3}$ , and $\bm{\xi}$ is generated from a Gaussian distribution with variance $\sigma^{2}$ . Here, the value of $d$ is taken to be small enough such that finite-size effects are nonignorable. The values of $\left\langle{{\mathrm{TP}}}\right\rangle,\left\langle{{\mathrm{FP}}}\right\rangle$ and $\epsilon_{x}$ obtained from our replica predictions (24)-(26) are compared with the average over $10^{4}$ experimental runs. The average with respect to $(\bm{A}_{S},\bm{y})$ for obtaining the replica prediction was approximated using a Monte Carlo procedure over $10^{6}$ samples.

Figure 1 shows that all three values from theory and experiment are in good agreement for parameters $(\lambda,\sigma^{2})=(0.5,0.0)$ and $(0.5,0.5)$ .

3.2 Asymptotic results

Claims 3 and 4 are also verified via numerical experiments; see Supplementary Materials for numerical experiments on Claim 5. In order to access the critical point $\alpha_{C}$ in (27), Monte Carlo experiments were conducted to evaluate $s_{\lambda}^{(M)}$ for different values of $M$ . Figure 2 shows the value of $s_{\lambda}^{(M)}$ at $(\lambda,\sigma^{2})=(0.5,0.0)$ and $(0.5,0.5)$ for both $\bm{x}_{S}^{0}=\bm{1}_{3}$ and $\bm{x}_{S}^{0}=[\frac{1}{3},\frac{2}{3},1]$ . From its asympototic behavior, $\alpha_{C}$ can be evaluated as the values given in Table 1. Interestingly, for the case $\bm{x}_{S}^{0}=\bm{1}_{3}$ , $s_{\lambda}$ approaches $6$ and $10$ for $\sigma^{2}=0$ and $0.5$ respectively, which is equivalent to $2(d+\Gamma/\lambda^{2})$ given in Claim 5.

Figure 3 shows the average number of FP and partial support recovery probability over 10,000 experimental runs for $\alpha$ in the vicinity of the numerically evaluated $\alpha_{C}$ for different values of $N$ . We observe that for $\alpha<\alpha_{C}$ , the average FP is consistently nondecreasing with respect to $N$ , while partial support recovery probability is consistently nonincreasing with respect to $N$ .

4 Proofs

4.1 Proof of Claim 3

The following lemmas will be useful in the proof.

Lemma 1 (Lemma 1, Dossal et al., (2012)).

There is a finite increasing sequence $(\lambda_{t})_{t\leq K}$ with $\lambda_{0}=0$ such that for all $t<K$ , the sign and support of $\hat{\bm{x}}_{\lambda}(\mathbf{A}_{S},\bm{y})$ are constant on each interval $(\lambda_{t},\lambda_{t+1})$ .

Lemma 2 (Lemma 1, Tibshirani and Taylor, (2012)).

The Lasso fit is 1-Lipschitz continuous with respect to $\ell_{2}$ norm.

Lemma 3 (Theorem II.13, Davidson and Szarek, (2001)).

Let $\mathbf{A}\in\mathbb{R}^{M\times d}$ be a random matrix with i.i.d standard Gaussian entries. The largest and smallest eigenvalue of $\mathbf{B}=\mathbf{A}^{\mathsf{T}}\mathbf{A}$ satisfy

[TABLE]

for $t>0$ and

[TABLE]

for $0<t<\sqrt{M}-\sqrt{d}$ .

We now prove Claim 3. Define

[TABLE]

Let us evaluate the difference between $s^{(M)}_{(1+\chi)\lambda,Q}$ and $s^{(M)}_{\lambda,0}=s^{(M)}_{\lambda}$ when $\left\langle{{\mathrm{FP}}}\right\rangle<O(N^{-c})$ . Using the Cauchy Schwartz inequality and symmetry $\bm{\gamma}_{{\lambda}}({\bm{y}})=-\bm{\gamma}_{{\lambda}}({-\bm{y}})$ ,

[TABLE]

The triangle inequality and Lemma 2 implies that

[TABLE]

and similarily,

[TABLE]

To derive a bound for the last term in (4.1) and (4.1), Lemma 1 is employed. Let the support and sign of $\hat{\bm{x}}_{\lambda}(\mathbf{A}_{S},\bm{y})$ be constant in intervals $(M\lambda_{t},M\lambda_{t+1})\ (t=0,\cdots,K-1)$ , where $\lambda=\lambda_{0}<\cdots<\lambda_{K}=(1+\chi)\lambda$ . Let the support set in interval $(\lambda_{t},\lambda_{t+1})$ be given by $I_{t}$ , and define $\bm{s}_{t}\in\quantity{-1,0,1}^{\absolutevalue{I_{t}}}$ be the sign vector of $\hat{\bm{x}}_{\lambda^{\prime}}(\mathbf{A}_{S},\bm{y})\ (\lambda^{\prime}\in(\lambda_{t},\lambda_{t+1}))$ restricted to $I_{t}$ . From the KKT conditions, the Lasso fit is expressed as

[TABLE]

where $\mathbf{M}^{+}$ denotes the pseudoinverse of matrix $\mathbf{M}$ . We deduce

[TABLE]

Lemma 3, with the inclusion principle $\rho((\mathbf{A}_{SI_{t}}^{\mathsf{T}}\mathbf{A}_{SI_{t}})^{-1})\leq\rho((\mathbf{A}_{S}^{\mathsf{T}}\mathbf{A}_{S})^{-1})$ implies that w.a.h.p., $\norm{\bm{\gamma}_{{(1+\chi)\lambda}}({\bm{y}})-\bm{\gamma}_{{\lambda}}({\bm{y}})}\leq 2\chi\sqrt{d\lambda^{2}M}$ . The relations (31) – (4.1), and inequality $\int D\bm{z}\norm{\bm{z}}=\sqrt{2}\Gamma((M+1)/2)/\Gamma(M/2)<\sqrt{M}$ then leads to the following holding w.a.h.p.

[TABLE]

We now use the following lemma which shows that $Q$ and $\chi$ are negligible almost surely.

Lemma 4.

Under the assumptions of Claim 3, $\chi<N^{-c/2}$ and $Q<N^{-c/4}$ holds w.a.h.p.

The proof is given in Supplementary Materials. Since $\norm{\bm{y}}<\norm{\mathbf{A}_{S}\bm{x}^{0}}+\norm{\bm{\xi}}$ is bounded by $M^{2}$ w.p.a.1 from Lemma 3 and Assumption 1.C, the right hand side of eq. (4.1) is of $O(N^{-c/8})$ w.p.a.1.

We therefore have

[TABLE]

On the other hand, the extremum conditions (16) and (2.1) imply that $\hat{\chi}$ is always bounded.

Lemma 5.

Suppose the extremum conditions (15)–(18) are satisfied. Then, the variable $\hat{\chi}$ satsfies

[TABLE]

Combined with (35), for sufficiently large $M$

[TABLE]

holds for arbitrary constant $\epsilon>0$ . This implies that $\frac{1}{2}\alpha\lambda^{2}(1+\epsilon)$ must be larger than the median of $s_{\lambda}^{(M)}$ . Now, the difference between the median and average is no larger than one standard deviation, which is negligible from Assumption 1.B. This yields the statement of the claim in the limit $M\to\infty$ .

4.2 Proof of Claim 4

From Theorem 6 in Osborne et al., (2000), the number of false positives is bounded by $\min(M,N)$ . Hence, we have $\left\langle{{\mathrm{FP}}}\right\rangle<M\times{\Pr}\quantity[{\mathrm{FP}}\neq 0]=O(N^{-c})$ for some $c>0$ . The statement of Claim 4 then follows from Claim 3.

4.3 Proof of Claim 5

From Claim 3 and 4, it suffices to show that

[TABLE]

The KKT conditions imply that w.p.a.1, $\mathbf{A}_{S}\hat{\bm{x}}=\mathbf{A}_{S}\mathbf{A}_{S}^{+}(\bm{y}-M\lambda(\mathbf{A}_{S}^{+})^{\mathsf{T}}\bm{s}),$ where we abbreviated $\hat{\bm{x}}:=\hat{\bm{x}}_{\lambda}(\mathbf{A}_{S},\bm{y}),$ and $\bm{s}={\rm sgn}\quantity(\hat{\bm{x}})$ . Therefore, $\bm{y}-\mathbf{A}_{S}\hat{\bm{x}}$ can be decomposed into a sum of two linearly independent vectors

[TABLE]

where $\bm{v}:=M\lambda\mathbf{A}_{S}(\mathbf{A}_{S}^{\mathsf{T}}\mathbf{A}_{S})^{-1}\bm{s},$ $\bm{v}_{\perp}:=\mathcal{P}_{\ker(\mathbf{A}_{S})}(\bm{y})=\mathcal{P}_{\ker(\mathbf{A}_{S})}(\bm{\xi})$ , and $\mathcal{P}_{\ker(\mathbf{A}_{S})}$ is the projection onto the kernel of $\mathbf{A}_{S}$ . The average of the squared norm of $\bm{v}$ can be evaluated as

[TABLE]

where the last inequality follows from Jensen’s inequality and $\mathbb{E}_{\rm eff}\lambda_{\min}(\bm{A}_{S}^{\mathsf{T}}\bm{A}_{S})\geq(\sqrt{M}-\sqrt{d})^{2}$ (Davidson and Szarek, , 2001).

To obtain a lower bound on the squared norm of $\bm{v}_{\perp}$ , fix the vector $\bm{\xi}$ . Noticing that entries of $\mathbf{A}_{S}^{\mathsf{T}}\bm{\xi}/\norm{\bm{\xi}}^{2}$ are i.i.d. standard Gaussian, the tail bound for $\chi^{2}$ –random variables (Laurent and Massart, , 2000) implies that for some constant $C>0$ ,

[TABLE]

Using this inequality, (30) with $t=\sqrt{2\log M}$ and the union bound, we have that

[TABLE]

Equation (38) immediately follows from (40) and (4.3), which completes the proof.

5 Conclusion

In this paper, we provided an analysis based on an enhanced replica method for assessing the average performance of the Lasso estimator under ultra-sparse conditions. Besides, we deduced conditions necessary for support recovery which are derived from the oracle Lasso estimator. Numerical experiments strongly support the validity of our analysis.

The methodological novelty originates from an observation of finite-size effects and correlations within the active set, which is implicitly assumed to be negligible in the conventional replica analysis. We anticipate that this framework is applicable to analysis of other machine learning or optimization problems where finite-size effects are nonnegligible. Extending this method further to more general sensing matrix ensembles is also another exciting direction for future work.

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Nos. 22J21581 (KO), 21K21310 (TT), 17H00764, 19H01812, 22H05117 (YK) and JST CREST Grant Number JPMJCR1912 (YK).

References

Abbara et al., (2020)

Abbara, A., Kabashima, Y., Obuchi, T., and Xu, Y. (2020).

Learning performance in inverse Ising problems with sparse teacher couplings.

Journal of Statistical Mechanics: Theory and Experiment, 2020(7):073402.

Bayati and Montanari, (2011)

Bayati, M. and Montanari, A. (2011).

The dynamics of message passing on dense graphs, with applications to compressed sensing.

IEEE Transactions on Information Theory, 57(2):764–785.

Bayati and Montanari, (2012)

Bayati, M. and Montanari, A. (2012).

The lasso risk for gaussian matrices.

IEEE Transactions on Information Theory, 58(4):1997–2017.

Bhadra et al., (2017)

Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017).

The Horseshoe+ Estimator of Ultra-Sparse Signals.

Bayesian Analysis, 12(4):1105 – 1131.

Candès and Plan, (2009)

Candès, E. J. and Plan, Y. (2009).

Near-ideal model selection by $\ell$ 1 minimization.

The Annals of Statistics, 37(5A):2145 – 2177.

Chang et al., (2011)

Chang, S.-H., Cosman, P. C., and Milstein, L. B. (2011).

Chernoff-type bounds for the gaussian error function.

IEEE Transactions on Communications, 59(11):2939–2944.

Davidson and Szarek, (2001)

Davidson, K. R. and Szarek, S. J. (2001).

Chapter 8 - local operator theory, random matrices and banach spaces.

In Handbook of the Geometry of Banach Spaces, volume 1, pages 317–366. Elsevier Science B.V.

Donoho, (2006)

Donoho, D. L. (2006).

Compressed sensing.

IEEE Transactions on Information Theory, 52(4):1289–1306.

Donoho et al., (1992)

Donoho, D. L., Johnson, I. M., Hoch, J. C., and Stern, A. S. (1992).

Maximum Entropy and the Nearly Black Object.

Journal of the Royal Statistical Society. Series B (Methodological), 54(1):41 – 81.

Donoho et al., (2009)

Donoho, D. L., Maleki, A., and Montanari, A. (2009).

Message-passing algorithms for compressed sensing.

Proceedings of the National Academy of Sciences, 106(45):18914–18919.

Dossal et al., (2012)

Dossal, C., Chabanol, M.-L., Peyré, G., and Fadili, J. (2012).

Sharp support recovery from noisy random measurements by $\ell$ 1-minimization.

Applied and Computational Harmonic Analysis, 33(1):24–43.

Fletcher et al., (2009)

Fletcher, A. K., Rangan, S., and Goyal, V. K. (2009).

Necessary and sufficient conditions for sparsity pattern recovery.

IEEE Transactions on Information Theory, 55(12):5758–5772.

Ghiringhelli et al., (2015)

Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C., and Scheffler, M. (2015).

Big data of materials science: Critical role of the descriptor.

Physical Review Letters, 114:105503.

Guo and Verdú, (2005)

Guo, D. and Verdú, S. (2005).

Randomly spread CDMA: asymptotics via statistical physics.

IEEE Transactions on Information Theory, 51(6):1983–2010.

Kabashima, (2003)

Kabashima, Y. (2003).

A CDMA multiuser detection algorithm on the basis of belief propagation.

Journal of Physics A: Mathematical and General, 36(43):11111–11121.

Kabashima et al., (2009)

Kabashima, Y., Wadayama, T., and Tanaka, T. (2009).

A typical reconstruction limit for compressed sensing based on $\ell_{p}$ -norm minimization.

Journal of Statistical Mechanics: Theory and Experiment, 2009(09):L09003.

Kim et al., (2016)

Kim, C., Pilania, G., and Ramprasad, R. (2016).

From organized high-throughput data to phenomenological theory using machine learning: The example of dielectric breakdown.

Chemistry of Materials, 28(5):1304–1311.

Laurent and Massart, (2000)

Laurent, B. and Massart, P. (2000).

Adaptive estimation of a quadratic functional by model selection.

The Annals of Statistics, 28(5):1302 – 1338.

Meinshausen and Bühlmann, (2006)

Meinshausen, N. and Bühlmann, P. (2006).

High-dimensional graphs and variable selection with the Lasso.

The Annals of Statistics, 34(3):1436 – 1462.

(20)

Meng, X., Obuchi, T., and Kabashima, Y. (2021a).

Ising model selection using $\ell_{1}$ -regularized linear regression: A statistical mechanics analysis.

In Advances in Neural Information Processing Systems, volume 34, pages 6290–6303.

(21)

Meng, X., Obuchi, T., and Kabashima, Y. (2021b).

Structure learning in inverse Ising problems using $\ell_{2}$ -regularized linear estimator.

Journal of Statistical Mechanics: Theory and Experiment, 2021(5):053403.

Mézard and Montanari, (2009)

Mézard, M. and Montanari, A. (2009).

Information, Physics, and Computation.

Oxford University Press, Inc., USA.

Mézard et al., (1986)

Mézard, M., Parisi, G., and Virasoro, M. (1986).

Spin Glass Theory and Beyond.

WORLD SCIENTIFIC.

Osborne et al., (2000)

Osborne, M. R., Presnell, B., and Turlach, B. A. (2000).

On the lasso and its dual.

Journal of Computational and Graphical Statistics, 9(2):319–337.

Pilania et al., (2016)

Pilania, G., Mannodi-Kanakkithodi, A., Uberuaga, B. P., Ramprasad, R., Gubernatis, J. E., and Lookman, T. (2016).

Machine learning bandgaps of double perovskites.

Scientific Reports, 6(1):19375.

Rush and Venkataramanan, (2018)

Rush, C. and Venkataramanan, R. (2018).

Finite sample analysis of approximate message passing algorithms.

IEEE Transactions on Information Theory, 64(11):7264–7286.

Stojnic, (2013)

Stojnic, M. (2013).

A framework to characterize performance of lasso algorithms.

arXiv, https://arxiv.org/abs/1303.7291.

Su et al., (2017)

Su, W., Bogdan, M., and Candès, E. (2017).

False discoveries occur early on the lasso path.

The Annals of Statistics, 45(5):2133–2150.

Takahashi and Kabashima, (2022)

Takahashi, T. and Kabashima, Y. (2022).

Macroscopic analysis of vector approximate message passing in a model-mismatched setting.

IEEE Transactions on Information Theory, 68(8):5579–5600.

Thrampoulidis et al., (2018)

Thrampoulidis, C., Abbasi, E., and Hassibi, B. (2018).

Precise error analysis of regularized $m$ -estimators in high dimensions.

IEEE Transactions on Information Theory, 64(8):5592–5628.

Tibshirani, (1996)

Tibshirani, R. (1996).

Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288.

Tibshirani and Taylor, (2012)

Tibshirani, R. J. and Taylor, J. (2012).

Degrees of freedom in lasso problems.

The Annals of Statistics, 40(2):1198 – 1232.

Vehkaperä et al., (2016)

Vehkaperä, M., Kabashima, Y., and Chatterjee, S. (2016).

Analysis of regularized ls reconstruction and random matrix ensembles in compressed sensing.

IEEE Transactions on Information Theory, 62(4):2100–2124.

(34)

Wainwright, M. J. (2009a).

Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting.

IEEE Transactions on Information Theory, 55(12):5728–5741.

(35)

Wainwright, M. J. (2009b).

Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (lasso).

IEEE Transactions on Information Theory, 55(5):2183–2202.

Wang et al., (2020)

Wang, H., Yang, Y., Bu, Z., and Su, W. (2020).

The complete lasso tradeoff diagram.

In Advances in Neural Information Processing Systems, volume 33, pages 20051–20060. Curran Associates, Inc.

Zdeborovà and Krzakala, (2016)

Zdeborovà, L. and Krzakala, F. (2016).

Statistical physics of inference: thresholds and algorithms.

Advances in Physics, 65(5):453–552.

Zhang and Huang, (2008)

Zhang, C.-H. and Huang, J. (2008).

The sparsity and bias of the Lasso selection in high-dimensional linear regression.

The Annals of Statistics, 36(4):1567 – 1594.

Zhao and Yu, (2006)

Zhao, P. and Yu, B. (2006).

On model selection consistency of lasso.

Journal of Machine Learning Research, 7(90):2541–2563.

Supplementary Materials

Appendix A Detailed derivation of Claim 1

Here, we derive the expression in Claim 1; see Figure 4 for an outline of the calcuation. For simplicity, we abbreviate $\mathbb{E}_{\mathbf{A}_{\backslash S}}$ , the average over $\mathbf{A}$ excluding the submatrix acting on $S$ , as $\mathbb{E}$ , and $\mathbf{A}_{\backslash S}$ as the submatrix of $\mathbf{A}$ excluding $\mathbf{A}_{S}$ . Using the shorthand expression $\differential\bm{x}^{a}_{S}:=\prod_{i\in S}\differential x_{i}^{a}$ and $\differential\bm{\Delta}^{a}:=\prod_{i\neq S}\differential\Delta_{i}^{a}$ , $\mathbb{E}Z^{n}_{\beta}(\mathbf{A},\bm{y})$ can be written as

[TABLE]

Using the Fourier representation, the average of the delta functions over $\mathbf{A}_{\backslash S}=(\tilde{\bm{a}}_{1},\cdots,\tilde{\bm{a}}_{M})^{\mathsf{T}}$ is given by

[TABLE]

where we defined the matrix $\mathbf{Q}$ as $\quantity(\mathbf{Q})_{ab}:=(\bm{\Delta}^{a})^{\mathsf{T}}\bm{\Delta}^{b}$ and used the notation $\bm{h}_{\mu}:=(h_{\mu}^{1},\cdots,h_{\mu}^{n})\in\mathbb{R}^{n}$ without confusion. This implies that the vector $\bm{h}_{\mu}$ is Gaussian with covariance matrix $\mathbf{Q}$ . Now, the replica symmetric ansatz (9) implies that the integral over $\quantity{\bm{\Delta}^{a}}_{a=1}^{n}$ is dominated by the subspace of the form

[TABLE]

which allows us to simplify the profile of $h_{\mu}^{a}$ as

[TABLE]

where $z_{\mu}$ and $v_{\mu}^{a}\ (a=1,\cdots,n)$ are all i.i.d. standard Gaussian variables. Using (43)–(45) yields the expression

[TABLE]

where $\mathcal{L}$ is given by

[TABLE]

The integral with respect to $\bm{x}_{S}$ can be evaluated using Laplace’s method for large $\beta$ , yielding

[TABLE]

where the subleading terms are ignored. Similarily, $\mathcal{I}$ is given by

[TABLE]

Therefore, ignoring the subleading term with respect to $\beta$ ,

[TABLE]

The log of the integral with respect to $D\hat{\bm{z}}$ can be expanded as

[TABLE]

where Laplace’s approximation was used for large $\beta$ to obtain the third line. Substituting (47), (A) and (49) into (46), using the saddle point method for large $M$ results in

[TABLE]

Noticing that

[TABLE]

and finally rescaling $\hat{Q}\leftarrow M\hat{Q}$ and $\hat{\chi}\leftarrow M\hat{\chi}$ , one obtains (14).

Appendix B Proof of auxiliary lemmas

B.1 Proof of Lemma 4

From (16) and (2.1), we have $\chi=$$f\quantity(\frac{\tilde{N}}{M-\bar{d}}{\rm erfc}\quantity(\sqrt{\frac{\Lambda}{2\hat{\chi}}})),$ where $f(x)=x/(1-x)$ , and $\bar{d}$$:=\int D\bm{z}\norm{\hat{\bm{x}}_{(1+\chi)\lambda}(\sqrt{Q}\bm{z}+\bm{y})}_{0}$ . From $\chi>0$ , we see that $f$ is a increasing function. By using the Markov inequality, it can be deduced that

[TABLE]

which proves the first part of the lemma with $c_{1}=c/2$ . For the probability bound on $Q$ , using ${\rm erfc}(x)<\frac{1}{x\sqrt{\pi}}e^{-x^{2}}$ ,

[TABLE]

where

[TABLE]

Using ${\rm erfc}^{-1}(Mx/N)>(1-x)^{2}$ for $M/N<1$ , we have for large enough $M$ ,

[TABLE]

Since both $g$ and $g^{-1}$ are nonnegative and increasing, for large enough $M$ ,

[TABLE]

The Markov inequality then implies the second part of the lemma with $c_{1}=c/2$ :

[TABLE]

B.2 Proof of Lemma 5

Equations (16) and (2.1) imply that $N{\rm erfc}\quantity(\frac{\Lambda}{\sqrt{2\hat{\chi}}})=\hat{Q}\chi\leq M\frac{\chi}{1+\chi}\leq M$ holds for any $(\mathbf{A}_{S},\bm{y})$ . Thus, $\hat{\chi}$ is deterministically upper-bounded as

[TABLE]

Now, ${\rm erfc}$ satisfies (Chang et al., , 2011) for $0<\epsilon<1/3$ ,

[TABLE]

Applying this inequality to (52) with $\epsilon=M^{-1}$ and $N=\exp(M/\alpha)$ for $M>3$ yields

[TABLE]

Appendix C Additional numerical experiments : Necessary condition for perfect support recovery

To verify the necessary sample complexity for perfect recovery given by Claim 5, numerical experiments were conducted. The profile of $\bm{x}^{0}$ is the same as that of Section 3.1, and the regularization parameter is taken as $\lambda=0.5$ . Figure 5 shows the perfect support recovery probability for noise distributed according to the Gaussian, uniform, and Laplace distribution. Clearly, for all three cases, perfect recovery fails with finite probability as $N$ tends to infinity when $\alpha$ is less than the value indicated by Claim 5.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbara et al., (2020) Abbara, A., Kabashima, Y., Obuchi, T., and Xu, Y. (2020). Learning performance in inverse Ising problems with sparse teacher couplings. Journal of Statistical Mechanics: Theory and Experiment , 2020(7):073402.
2Bayati and Montanari, (2011) Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory , 57(2):764–785.
3Bayati and Montanari, (2012) Bayati, M. and Montanari, A. (2012). The lasso risk for gaussian matrices. IEEE Transactions on Information Theory , 58(4):1997–2017.
4Bhadra et al., (2017) Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The Horseshoe+ Estimator of Ultra-Sparse Signals. Bayesian Analysis , 12(4):1105 – 1131.
5Candès and Plan, (2009) Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by ℓ ℓ \ell 1 minimization. The Annals of Statistics , 37(5A):2145 – 2177.
6Chang et al., (2011) Chang, S.-H., Cosman, P. C., and Milstein, L. B. (2011). Chernoff-type bounds for the gaussian error function. IEEE Transactions on Communications , 59(11):2939–2944.
7Davidson and Szarek, (2001) Davidson, K. R. and Szarek, S. J. (2001). Chapter 8 - local operator theory, random matrices and banach spaces. In Handbook of the Geometry of Banach Spaces , volume 1, pages 317–366. Elsevier Science B.V.
8Donoho, (2006) Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory , 52(4):1289–1306.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

1.1 Contributions

1.2 Related Work

Irrepresentability Condition.

AMP theory.

Replica method.

1.3 Preliminaries

2 Replica analysis

2.1 Outline of the derivation

Claim 1**.**

2.2 Performance assessment of Lasso

Claim 2** (Average with respect to active and inactive sets).**

2.3 Necessary condition for support recovery

Assumption 1**.**

Claim 3** (Necessary sample complexity for asymptotically zero false positives).**

Claim 4** (Necessary sample complexity for partial support recovery).**

Claim 5** (Necessary sample complexity for perfect support recovery).**

3 Numerical experiments

3.1 Non-asymptotic results

3.2 Asymptotic results

4 Proofs

4.1 Proof of Claim 3

Lemma 1** (Lemma 1, Dossal et al., (2012)).**

Lemma 2** (Lemma 1, Tibshirani and Taylor, (2012)).**

Lemma 3** (Theorem II.13, Davidson and Szarek, (2001)).**

Lemma 4**.**

Lemma 5**.**

4.2 Proof of Claim 4

4.3 Proof of Claim 5

5 Conclusion

Acknowledgements

References

Appendix A Detailed derivation of Claim 1

Appendix B Proof of auxiliary lemmas

B.1 Proof of Lemma 4

B.2 Proof of Lemma 5

Appendix C Additional numerical experiments : Necessary condition for perfect support recovery

Claim 1.

Claim 2 (Average with respect to active and inactive sets).

Assumption 1.

Claim 3 (Necessary sample complexity for asymptotically zero false positives).

Claim 4 (Necessary sample complexity for partial support recovery).

Claim 5 (Necessary sample complexity for perfect support recovery).

Lemma 1 (Lemma 1, Dossal et al., (2012)).

Lemma 2 (Lemma 1, Tibshirani and Taylor, (2012)).

Lemma 3 (Theorem II.13, Davidson and Szarek, (2001)).

Lemma 4.

Lemma 5.