This paper analyzes the high SNR consistency of compressive sensing algorithms in underdetermined linear models, providing necessary and sufficient conditions, and proposing SNR-adaptive tuning parameters that improve performance at high SNR.
Contribution
It derives the first comprehensive necessary and sufficient conditions for high SNR consistency of popular CS algorithms and introduces SNR-adaptive tuning parameters.
Findings
01
Necessary conditions show high SNR inconsistency with existing tuning parameters.
02
Proposed SNR-adaptive tuning parameters achieve high SNR consistency.
03
Numerical results demonstrate improved performance over existing methods at high SNR.
Abstract
High signal to noise ratio (SNR) consistency of model selection criteria in linear regression models has attracted a lot of attention recently. However, most of the existing literature on high SNR consistency deals with model order selection. Further, the limited literature available on the high SNR consistency of subset selection procedures (SSPs) is applicable to linear regression with full rank measurement matrices only. Hence, the performance of SSPs used in underdetermined linear models (a.k.a compressive sensing (CS) algorithms) at high SNR is largely unknown. This paper fills this gap by deriving necessary and sufficient conditions for the high SNR consistency of popular CS algorithms like l0-minimization, basis pursuit de-noising or LASSO, orthogonal matching pursuit and Dantzig selector. Necessary conditions analytically establish the high SNR inconsistency of CS algorithms…
\begin{array}[]{ll}\mathbb{P}(\mathcal{E}_{2}^{C}|\mathcal{E}_{1})\leq&\sum\limits_{i=1}^{k^{*}-1}\mathbb{P}(\overset{{S_{i}}}{\{\text{SC is satisfied for}\ i\}}|\mathcal{E}_{1})+\\
&\mathbb{P}(\overset{S_{k^{*}}}{\{\text{SC is not satisfied for }k^{*}}\}|\mathcal{E}_{1}).\end{array}
\begin{array}[]{ll}\mathbb{P}(\mathcal{E}_{2}^{C}|\mathcal{E}_{1})\leq&\sum\limits_{i=1}^{k^{*}-1}\mathbb{P}(\overset{{S_{i}}}{\{\text{SC is satisfied for}\ i\}}|\mathcal{E}_{1})+\\
&\mathbb{P}(\overset{S_{k^{*}}}{\{\text{SC is not satisfied for }k^{*}}\}|\mathcal{E}_{1}).\end{array}
P(Si)≤P(∥w∥2+σΓ4>λi),∀i<k∗.
P(Si)≤P(∥w∥2+σΓ4>λi),∀i<k∗.
P(Si)≤P(∥w∥2+σΓ5>λi),∀i<k∗.
P(Si)≤P(∥w∥2+σΓ5>λi),∀i<k∗.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
High signal to noise ratio (SNR) consistency of model selection criteria in linear regression models has attracted a lot of attention recently. However, most of the existing literature on high SNR consistency deals with model order selection. Further, the limited literature available on the high SNR consistency of subset selection procedures (SSPs) is applicable to linear regression with full rank measurement matrices only. Hence, the performance of SSPs used in underdetermined linear models (a.k.a compressive sensing (CS) algorithms) at high SNR is largely unknown. This paper fills this gap by deriving necessary and sufficient conditions for the high SNR consistency of popular CS algorithms like l0-minimization, basis pursuit de-noising or LASSO, orthogonal matching pursuit and Dantzig selector. Necessary conditions analytically establish the high SNR inconsistency of CS algorithms when used with the tuning parameters discussed in literature. Novel tuning parameters with SNR adaptations are developed using the sufficient conditions and the choice of SNR adaptations are discussed analytically using convergence rate analysis. CS algorithms with the proposed tuning parameters are numerically shown to be high SNR consistent and outperform existing tuning parameters in the moderate to high SNR regime.
Subset selection or variable selection in linear regression models is the identification of the support of regression vector β, i.e., I=supp(β)={j:βj=0} in the regression model y=Xβ+w. Here, X∈Rn×p is a known design matrix with unit l2 norm columns, y∈Rn is the observed vector and w∼N(0n,σ2In) is the additive white Gaussian noise with known variance σ2. Let
k∗ denotes the number of non-zero entries in β. In this paper, we consider subset selection in underdetermined linear models, i.e., X with more columns than rows (n≤p). This problem studied under the compressive sensing (CS) paradigm is of fundamental importance in statistical signal processing, machine learning etc. Many compressive sensing (CS) algorithms with varying performance complexity trade-offs and optimality conditions are available[1, 2, 3, 4, 5, 6] for this purpose. The performance of these CS algorithms are evaluated either in terms of mean square error (MSE) between β and the estimate β^ returned by the CS algorithm[7] or the correctness with which the estimated support I^=supp(β^) matches the true support I [8]. In this paper, we evaluate CS algorithms in terms of the probability of support recovery error defined by PE=P(I^=I).
Traditionally, PE is evaluated in the large sample regime, i.e., n→∞ or (n,p)→∞[9]. In their landmark paper [10], Ding and Kay demonstrated that subset selection procedures
(SSPs) in overdetermined linear models that are large sample consistent (i.e., PE→0asn→∞) often performs poorly in a finite n and high signal to noise ratio (SNR) (i.e., small σ2) regime. This result generated great interest in the signal processing community on the behaviour of SSPs as σ2→0. Formally, a SSP is said to be high SNR consistent if its’ PE→0 as σ2→0. In this paper, we discuss the high SNR consistency of popular CS algorithms that are used for subset selection in underdetermined linear models. After presenting the mathematical notations, we elaborate on the existing literature on high SNR consistency and CS algorithms.
I-A Notations used in this paper.
col(X) the column space of X. XT is the transpose and X†=(XTX)−1XT is the Moore-Penrose pseudo inverse of X (if X has full column rank). PX=XX† is the projection matrix onto col(X). In represents an n×n identity matrix and 0n represents an n×1 zero vector. XJ denotes the sub-matrix of X formed using the columns indexed by J. Xi,j is the [i,j]th entry of X. If X is clear from the context, we use the shorthand PJ for PXJ. aJ or a(J) denotes the entries of a indexed by J. N(u,C) is a Gaussian vector with mean u and covariance C. χj2 is a central chi square distribution with j degrees of freedom (d.o.f) and χj2(λ) is a non central chi square distribution with j d.o.f and non-centrality λ. a∼b implies that a and b are identically distributed. ∣()∣ denotes the absolute value for scalar arguments and cardinality for set arguments. ∥a∥q=(j∑∣aj∣q)q1 for 1≤q<∞ is the lq norm, ∥a∥∞=jmax∣aj∣ is the l∞ norm and ∥a∥0=∣supp(a)∣ is the l0 quasi norm of a respectively. a is called k∗-sparse iff ∥a∥0=k∗. ∥A∥m,l=∥x∥m=1max∥Ax∥l is the (m,l)th matrix norm. [p] denotes the set {1,…,p}. For any two index sets J1 and J2, the set difference J1/J2={j∈J1:j∈/J2}. f(n)=o(g(n)) iff n→∞limg(n)f(n)=0.
I-B Prior literature on high SNR consistency
Most of the existing literature related to high SNR consistency including the seminal work by Ding and Kay [10] are related to the model order selection (MOS) problem. MOS is a subset selection problem where I is restricted to the form I=[k∗]. Another interesting problem related to MOS is the estimation of smallest k~ such that β satisfies βj=0,∀j>k~ and βj can be zero or non-zero for j<k~. In both these cases, the statistician is required to the estimate the model order k∗ or k~. A number of MOS criteria like exponentially embedded family (EEF)[11], normalised maximum likelihood based minimum description length (NMDL)[12], g-prior based MDL (g-MDL)[13], forms of Bayesian Information criteria (BIC)[14, 15], sequentially normalised least squares (SNLS)[16] etc. are proved to high SNR consistent[17, 10, 18, 19]. All these MOS criteria can be formulated as the minimization of a penalised log likelihood
[TABLE]
over the collection of subsets {Jk}k=1p, where Jk=[k] and h(k,σ2) is a penalty function. Necessary and sufficient conditions (NSCs) for a MOS criterion to be high SNR consistent is derived in [17]. Applying MOS criteria to the general subset selection problem where I can be any subset of [p] involves the minimization of PLL(J) over the entire 2p subsets J⊆[p]. This approach though theoretically optimal is computationally intractable. Consequently a number of suboptimal but low complexity SSPs are developed. To the best of our knowledge, only two SSPs, both of which are based on the least squares (LS) estimate of β (i.e., β^LS=X†y) are known to be high SNR consistent[17, 8].
I-C Contributions of this paper
The existing literature on high SNR consistency in linear regression is applicable only to regression models with full column rank design matrices. Hence, existing literature is not applicable to underdetermined linear models, i.e., X with n<p. Identifying the true support I in an underdetermined linear model is an ill-posed problem unless certain structures are imposed on the regression vector β and design matrix X. Throughout this paper, we assume that the regression vector β is sparse, i.e., k∗=∣I∣≪p and k∗<n. The structure imposed on X depends on the particular CS algorithm used.
This paper makes the following contributions to CS literature from the viewpoint of high SNR consistency. We first derive NSCs on the tuning parameter Γ0 such that the support estimate I^=supp(β^) delivered by
[TABLE]
is high SNR consistent. It should be noted that optimization problem in l0-penalty is NP-hard [20]. Hence, a number of suboptimal techniques broadly belonging to two classes, convex relaxation (CR) [4, 2] and greedy algorithms [3, 6] are developed in literature. We mainly consider two CR techniques in this paper, viz.,
[TABLE]
[TABLE]
l1-penalty and l1-error are also known as basis pursuit de-noising (BPDN) or least absolute shrinkage and selection operator (LASSO).
We derive NSCs on Γ1, Γ2 such that l1-penalty and l1-error are high SNR consistent. We also derive NSCs on the hyper parameter Γ3 of the popular CR technique Dantzig selector [2] given by
[TABLE]
for the special case of111In this article we consider a popular formulation of CS algorithms where the tuning parameters are explicitly scaled by σ or σ2. Quite often σ or σ2 is included in the tuning parameter itself. For example, l0-penalty may be written as β^=b∈Rpargmin∥y−Xb∥22+λ0∥b∥0. Using the relation λ0=σ2Γ0, the NSCs derived in terms of Γ0 can be easily restated in terms of λ0 also. orthonormal X.
Orthogonal matching pursuit (OMP) [3, 21, 22, 23, 24] is a popular greedy algorithm with sound performance guarantees and low computational complexity in comparison with CR based SSPs. OMP is characterized by its’ stopping condition (SC). We also derive high SNR consistent SCs for OMP.
Necessary conditions derived for l0-penalty, l1-penalty, l1-error and DS analytically establish the high SNR inconsistency of these schemes with the values of {Γk}k=03 discussed in literature. High SNR inconsistency of OMP with popular SCs is numerically established. These inconsistencies are due to the absence of SNR adaptations in the tuning parameters. The sufficient conditions delivers a range of SNR adaptations for tuning parameters that will result in high SNR consistency. To compare various SNR adaptations, we derived simple bounds on the convergence rates of l1-penalty. Extensive numerical simulations conducted on various subset selection scenarios demonstrate the potential of some of these SNR adaptations to significantly outperform existing tuning parameters in the moderate to high SNR regime. In addition to being a topic of theoretical importance, high SNR consistency of CS algorithms have tremendous practical value. A number of applications such as multi user detection[25], on-off random access[26], CS based single snapshot direction of arrival [27] etc. demands support recovery with very low values of PE in the moderate to high SNR regime. The high SNR consistent tuning parameters derived in this article can be applied directly for such applications in the moderate to high SNR regime.
I-D Organization of paper
Section @slowromancapii@ gives mathematical preliminaries. Section @slowromancapiii@ discuss the high SNR consistency of l0-penalty, Section @slowromancapiv@ discuss the consistency of CR techniques and Section @slowromancapv@ discuss the consistency of OMP. Section @slowromancapvi@ validates the analytical results through numerical simulations.
II Mathematical preliminaries
In this section, we present a brief overview of mathematical concepts from CS and probability theory used in this article.
II-A * Qualifiers for design matrix X.*
When n<p, the linear equation y=Xβ has infinitely many possible solutions. Hence the support recovery problem is ill-posed even in the noiseless case. To uniquely recover the k∗-sparse vector β, the measurement matrix X has to satisfy certain well known regularity conditions.
Definition 1: The spark of a matrix X(spark(X)) is the smallest number of columns in X that are linearly dependent.
Consider the following the optimization problem.
[TABLE]
In words β^ is the sparsest vector that solves the linear equation y=Xb. The following lemma relates the unique recovery of sparse vectors with spark(X) in the absence of noise.
Lemma 1**.**
To uniquely recover all k∗-sparse vectors β using (1), it is necessary and sufficient that spark(X)>2k∗ [1].
The optimization problem (2) cannot be solved in polynomial time. For polynomial complexity CS algorithms like DS, l1-penalty, l1-error, OMP etc. spark(X)>2k∗ is not sufficient to guarantee unique recovery even in the noiseless case. A plethora of sufficient conditions including restricted isometry property (RIP)[1, 21], mutual incoherence condition (MIC)[4, 23], exact recovery condition (ERC)[3, 4] etc. are discussed in the literature. The high SNR analysis of CR techniques and OMP in this article uses ERC and MIC which are defined next.
Definition 2:- A matrix X and a vector β with support I is said to be satisfying ERC if the exact recovery coefficient erc(X,I)=j∈/Imax∥XI†Xj∥1 satisfies erc(X,I)<1.
It is known that ERC is a sufficient and worst case necessary condition for accurately recovering I from y=Xβ using OMP and the basis pursuit (BP) algorithm that solves
[TABLE]
in the noiseless case[3, 4]. ERC is also used to study the performance of l1-penalty, l1-error and OMP in noisy data[4, 23]. Since the ERC assumption involves the unknown support I, it is impossible to check ERC in practice. Likewise, verifying the spark assumption is computationally intractable. Hence, the MIC, an assumption which can be easily verified is popular in CS literature[23].
Definition 3:- A k∗-sparse vector β satisfies MIC, iff the mutual coherence μX=i=jmax∣XiTXj∣ satisfies μX<2k∗−11.
If μX<2k∗−11, then ERC is satisfied for all k∗-sparse vector β, i.e., erc(X,I)<1[3]. Likewise, MIC guarantees that spark(X)>2k∗[3]. Since, MIC implies both ERC and spark assumption, the analysis conducted based on ERC and spark are automatically applicable to problems satisfying MIC.
Remark 1*.*
The number of measurements n is an important factor in deciding the properties of X like spark, μX etc. In this paper, we will not explicitly quantify n, however by stating conditions on spark(X), μX, ERC etc. we implicitly assume that n is sufficiently large enough to satisfy these conditions.
II-B * Standard Convergence concepts [Chapter 4,[28]].*
A collection of random variables (R.Vs) Xσ2 converges in probability (C.I.P) to a R.V Y, i.e., Xσ2→PY as σ2→0 iff ∀ϵ>0, σ2→0limP(∣Xσ2−Y∣>ϵ)=0. A R.V X is B.I.P iff it is finite almost everywhere, i.e., for any ϵ>0, ∃Rϵ<∞ such that P(∣X∣>Rϵ)<ϵ. For an event A, σ2→0limP(A)=0 iff for each ϵ>0, ∃σ∗2(ϵ)>0 such that P(A)≤ϵ, ∀σ2<σ∗2(ϵ). Next we describe the relationship between projection matrices and χ2 R.Vs[17].
Lemma 2**.**
Let P be an arbitrary n×n projection matrix with rank j. Then for any z∼N(u,σ2In), σ2∥Pz∥22∼χj2(σ2∥Pu∥22) and σ2∥(In−P)z∥22∼χn−j2(σ2∥(In−P)u∥22). Consider the two full rank sub matrices XJ1 and XJ2 formed by columns of X indexed by J1⊂J2. Let PJ1 and PJ2 represent the projection matrices onto the column space of XJ1 and XJ2 respectively. Then for any R.V z∼N(u,σ2In), σ2∥(PJ1−PJ2)z∥22∼χ∣J2∣−∣J1∣2(σ2∥(PJ1−PJ2)u∥22).
Next we state a frequently used convergence result [17].
Lemma 3**.**
Let z∼χj2(σ2λ), where λ>0 is a constant w.r.t σ2. Then σ2z→Pλ as σ2→0.
II-C High SNR consistency: Definition
The high SNR consistency results available in literature[10, 17] deals with full rank linear regression models. Since, uniqueness issues are absent when rank(X)=p, this definition of high SNR consistency demands that PE→0 as σ2→0 for every signal β∈Rp. In this article, we relax this definition to account for the uniqueness issues present in regression models with n<p using the concept of regression class. A regression class C is defined as the collection of matrix signal pairs (X,β) where perfect recovery is possible for a particular algorithm under noiseless conditions. For l0-penalty, C1={(X,β):spark(X)>2∣supp(β)∣} is a regression class. Similarly, C2={(X,β):μX≤2∣supp(β)∣−11} and C3={X,β:erc(X,supp(β))<1} forms regression classes for l1-penalty, l1-error and OMP. We now formally define high SNR consistency in underdetermined regression models.
Definition 4:- A SSP is said to be high SNR consistent for a regression class C if PE=P(I^=I) converges to zero as σ2→0 for every matrix vector pair (X,β)∈C.
In words, a SSP is high SNR consistent if it can deliver a PE arbitrarily close to zero by decreasing the noise variance σ2. Even though every signal in a regression class can be perfectly recovered under noiseless conditions (σ2=0), to achieve a near perfect recovery at high SNR (i.e., σ2=0, but close to zero), the tuning parameters for the SSPs need to be selected appropriately. In the following sections, we discuss the conditions on the tuning parameters such that the support can be recovered with arbitrary precision as σ2 decreases.
III High SNR consistency of l0-penalty based SSP.
In this section, we describe the high SNR behaviour of β^=b∈Rpargmin∣∣y−Xb∣∣22+Γ0σ2∣∣b∣∣0 and I^=supp(β^), where the tuning parameter Γ0 is a deterministic positive quantity. The values of Γ0 discussed in the literature includes the Akaike information criteria (AIC) with Γ0=2, minimum description length (MDL) or Bayesian information criteria (BIC) with Γ0=log(n), risk inflation criteria (RIC) of Foster and George (RIC-FG) with Γ0=2log(p)[29], RIC of Zhang and Shen (RIC-ZS) with Γ0=2log(p)+2log(log(p)) [30], extended Bayesian information criterion (EBIC) with Γ0=log(n)+∥b∥02γlog((∥b∥0p))[31] etc. The hyper parameter γ in EBIC is a user defined parameter. Under a set of regularity conditions on the matrix X and β, it was shown that l0-penalty is large sample consistent if, Γ0=o(nc2−c1), k∗log(p)=o(nc2−c1) and Γ0−2log(p)−log(log(p))→∞ as n→∞. Here, c1 and c2 are parameters depending on the regularity conditions[32]. This result hold true for (n,p,k∗)→∞ and n<p or n≪p. Note that these tuning parameters are derived based on the large sample behaviour of l0-penalty. The conditions for high SNR consistency of l0-penalty are not discussed in the literature to the best of our knowledge.
Next we state and prove the sufficient conditions for the high SNR consistency of l0-penalty.
Theorem 1**.**
Consider a matrix X which satisfies spark(X)>2k∗. Then for any k∗-sparse signal β, l0-penalty is high SNR consistent if σ2→0limΓ0=∞ and σ2→0limσ2Γ0=0.
Proof.
The optimization problem in l0-penalty can be stated more explicitly as I^=J⊂[p]argminL(J), where L(J)=b:supp(b)=Jmin∥y−Xb∥22+Γ0σ2∣J∣. When XJ has full rank, the solution to b:supp(b)=Jmin∥y−Xb∥22 = a∈R∣J∣min∥y−XJa∥22 is unique and equal to a^=(XJTXJ)−1XJTy. In this case, XJa^=PJy and b:supp(b)=Jmin∥y−Xb∥22 is equal to ∥(In−PJ)y∥22. Here PJ=XJ(XJTXJ)−1XJ is a projection matrix of rank ∣J∣=rank(XJ).
When XJ is rank deficient, the solution to b:supp(b)=Jmin∥y−Xb∥22 = a∈R∣J∣min∥y−XJa∥22
can be any one of the infinitely many vectors a^ that solves XJTXJa^=XJTy. A typical solution is denoted by
a^=(XJTXJ)−XJTy, where (XJTXJ)− is called the generalized inverse of XJTXJ[33]. The matrix XJ(XJTXJ)−XJ satisfies all the properties of a projection matrix of rank(XJ). We denotes this matrix by PJ itself with a caveat that rank(PJ)=rank(XJ)<∣J∣. With this convention, when XJ is rank deficient, b:supp(b)=Jmin∥y−Xb∥22=∥(In−PJ)y∥22. Hence, l0-penalty can be reformulated as
[TABLE]
Define the error event E={I^=I}={∃J∈[p]:L(J)≤L(I)}. Applying union bound to PE=P(E) gives
[TABLE]
where H1={J∈[p]:(In−PJ)Xβ=0n} and H2={J∈[p]:(In−PJ)Xβ=0n}. In words, H1 represent the subsets J⊆[p] such that the col(XJ) does not cover the signal subspace col(XI). For I={1,2}, assuming that the columns X1, X2 and X3 are linearly independent, the subsets J={1}, J={3}, J={1,3} etc. belongs to H1. Similarly, H2 represents the subsets J⊆[p] such that the col(XJ) cover the signal subspace col(XI). For I={1,2}, J={1,2,3}, J={1,2,3,4} etc. will belong to H2. We consider both these summations separately.
Case 1 (In−PJ)Xβ=0n:-
In this case, it can happen that ∣J∣>k∗, ∣J∣=k∗ or ∣J∣<k∗. Since I=supp(β), (In−PI)Xβ=0n. Thus, by Lemma 2, A1=σ2∥(In−PI)y∥22∼χn−k∗2. Likewise, (In−PJ)Xβ=0n implies that A2=σ2∥(In−PJ)y∥22∼χn−rank(XJ)2(σ2λJ), where λJ=∥(In−PJ)Xβ∥22>0. Hence,
[TABLE]
Since, A1∼χn−k∗2 is a B.I.P R.V, A1σ2→P0 as σ2→0. By Lemma 3, σ2A2→PλJ>0 as σ2→0. By the hypothesis of Theorem 1, Γ0σ2(∣J∣−k∗)→0 as σ2→0. This implies that (A2−A1)σ2+Γ0σ2(∣J∣−k∗)→PλJ>0. Now, by the definition of C.I.P, for any ϵ>0, ∃σJ2>0 such that P(∣(A2−A1)σ2+Γ0σ2(∣J∣−k∗)−λJ∣>2λJ)<ϵ, for all σ2<σJ2. This implies that
[TABLE]
∀σ2<σJ2. Thus, σ2→0limP(EJ)=0, ∀J∈H1. This together with ∣H1∣<∞ implies that σ2→0limP1=0.
Case 2 (In−PJ)Xβ=0n:-spark(X)>2k∗ implies that β is the sparsest solution to the equation Xb=Xβ. Hence, (In−PJ)Xβ=0n implies that ∣J∣>k∗. Since, (In−PJ)Xβ=0n, A2=σ2∥(In−PJ)y∥22∼χn−rank(XJ)2. Thus P(EJ) becomes
[TABLE]
Note that both A1 and A2 are B.I.P R.Vs with distribution independent of σ2 and so is A1−A2. Thus, ∃tϵ<∞ independent of σ2 such that P(A1−A2>tϵ)<ϵ. Since, ∣J∣>k∗, by the hypothesis of Theorem 1, Γ0(∣J∣−k∗)→∞ as σ2→0. Thus, ∃σJ2>0, such that Γ0(∣J∣−k∗)>tϵ, ∀σ2<σJ2. Combining, we get P(EJ)<ϵ,∀σ2<σJ2. Thus, σ2→0limP(EJ)=0, ∀J∈H2. This together with ∣H2∣<∞ implies that σ2→0limP2=0. Thus, under the hypothesis of Theorem 1, l0-penalty is high SNR consistent.
∎
Remark 2*.*
Theorem 1 details a range of SNR adaptations on Γ0 such that l0-penalty is high SNR consistent. However, different SNR adaptations satisfying Theorem 1 leads to different convergence rates of PE. The proof of Theorem 1 reveals that P1 is related to the probability of underestimation and P2 is related to the probability of overestimation in MOS problems detailed in [17]. To summarise, Γ0 with faster rate of increase to ∞ will have lower values of P2 and higher values of P1 and vice versa.
III-A * High SNR consistency of l0-penalty: Necessary conditions*
The SNR adaptations required by Theorem 1 are in sharp contrast to the σ2 independent values of Γ0 discussed in literature. The following theorem proves that l0-penalty with σ2 independent values of Γ0 are inconsistent at high SNR.
Theorem 2**.**
Consider a matrix X with spark(X)>2k∗. Then for any k∗-sparse vector β, l0-penalty is high SNR consistent only if σ2→0limΓ0=∞.
Proof.
Define J=I∪i, where i∈/I. Note that ∣J∣=k∗+1≤2k∗, ∀k∗≥1 and ∣J∣=1 if k∗=0. Further for any matrix X, spark(X)≥2. Hence, spark(X)>2k∗ implies that XJ has full rank for k∗≥0. This together with I⊂J implies that (In−PJ)Xβ=0n. Expanding L(J)=∥(In−PJ)y∥22+σ2Γ0∣J∣ and applying Lemma 2, we have
[TABLE]
where A=σ2yT(PJ−PI)y∼χ12, ∀σ2>0. A∼χ12 implies that A=Z2, where Z∼N(0,1). Thus, PE≥P(A>Γ0)=P(∣Z∣>Γ0)=2Q(Γ0),∀σ2>0. Here Q(x)=2π1∫t=x∞exp(−2t2)dt is the complementary cumulative distribution function of a N(0,1) R.V. Hence, l0-penalty is high SNR consistent only if σ2→0limΓ0=∞.
∎
Remark 3*.*
It follows directly from the proof of Theorem 2 that PE of l0-penalty with SNR independent Γ0 like BIC, AIC etc. satisfy PE≥2Q(Γ0), ∀σ2>0. For RIC-FG with Γ0=2log(p), the lower bound 2Q(Γ0) will be less than 0.01 only for p≥28 and 2Q(Γ0)≤0.001 only for p≥225. Hence, the performance of these criteria in small and medium sized problems will be suboptimal.
Theorems 2 implies that σ2→0limΓ0=∞ is a necessary condition for high SNR consistency. We next establish the necessity of σ2→0limσ2Γ0=0 for high SNR consistency.
Theorem 3**.**
Consider a matrix X which satisfies spark(X)>2k∗. Then for any k∗-sparse signal β with k∗≥1, l0-penalty is high SNR consistent only if σ2→0limσ2Γ0=0.
Proof.
Define J=I/i, where i∈I. Since, J⊂I and spark(X)>2k∗, it follows that (In−PJ)Xβ=0n. Expanding L(J)=∥(In−PJ)y∥22+σ2Γ0∣J∣ and applying Lemma 2, we have
[TABLE]
where A=yT(PI−PJ)y∼σ2χ12(σ2λ) with λ=∥(PI−PJ)Xβ∥22>0. By Lemma 3, A→Pλ as σ2→0. Suppose that σ2→0limσ2Γ0=λ1, where λ1>λ. Then, ∃σ12>0 such that 2λ1+λ<σ2Γ0<λ1, ∀σ2<σ12. This implies that
[TABLE]
Since, A→Pλ as σ2→0, for any ϵ>0, ∃σ22>0 such that P(∣A−λ∣>2λ1−λ)≤ϵ,∀σ2<σ22. Fix σ2(ϵ)=min(σ12,σ22). Then ∀σ2<σ2(ϵ),PE≥1−ϵ. Thus if λ1>λ, then σ2→0limPE=1. This implies that σ2→0limσ2Γ0<λ is a necessary condition for high SNR consistency. However, without a priori knowledge of non-zero entries of β, λ is unknown.
Hence, l0-penalty is high SNR consistent only if σ2→0limσ2Γ0=0.
∎
Remark 4*.*
The formulation of l0-penalty given in (4) is exactly similar to that of MOS problems given in (1) except that the search space of MOS is a very small subset of the search space in l0-penalty. This is reflected in the similarity of NSCs for MOS derived in [17] and Theorems 1-3 for subset selection. It is also true that different values of Γ0 gives EEF, NMDL etc. as special cases. Hence, Theorems 1-3 can be seen as an extension of the existing high SNR consistency results in [17, 10, 19, 18] to subset selection problems. However, the novelty of Theorems 1-3 lies in the fact that it explicitly takes into account the identifiability issues associated with subset selection in underdetermined linear models. These structural issues were not considered in [17, 10, 19, 18] which dealt with MOS in overdetermined linear regression models.
IV High SNR consistency of convex relaxation based SSPs
In this section, we derive NSCs on the tuning parameters {Γi}i=13 such that l1-penalty, l1-error and DS are high SNR consistent. Unlike the NP-hard l0-penalty which is computationally infeasible except in small sized problems, the CR based SSPs discussed in this section and the greedy algorithms like OMP discussed in Section @slowromancapv@ can be implemented with polynomial complexity. Hence, these techniques are practically important. Unlike the high SNR consistency of l0-penalty whose connections with the high SNR consistency in MOS problems we previously mentioned, the high SNR consistency of CR and greedy algorithms are not discussed in open literature to the best of our knowledge. We first discuss the l1-penalty based SSP.
IV-A * High SNR consistency of l1-penalty: Sufficient conditions*
In this section, we discuss the high SNR behaviour of
β^=b∈Rpargmin21∥y−Xb∥22+Γ1σ∥b∥1
and I^=supp(β^). This is a widely used SSP in high dimensional statistics. l1-penalty is the convex program that is closest to the optimal but NP-hard l0-penalty. Commonly used values of Γ1 include Γ1=22log(p)[34], Γ1=8(1+η)log(p−k∗) [7], Γ1=10log(p) [35] etc. Here, η>0 is a constant. The large sample consistency of l1-penalty is also widely studied. For a fixed p and k∗, all values of Γ1 satisfying nΓ1→0 and n21+cΓ1→∞ as n→∞ results in large sample consistency under a set of regularity conditions [9]. c depends on these regularity conditions. However, the consistency of l1-penalty as σ2→0 is not discussed in literature to the best of our knowledge. Next we state and prove the sufficient conditions for the high SNR consistency of l1-penalty.
Theorem 4**.**
l1-penalty is high SNR consistent for any matrix signal pair (X,β) satisfying the ERC provided that the tuning parameter Γ1 satisfies σ2→0limΓ1=∞ and σ2→0limσΓ1=0.
Proof.
The proof of Theorem 4 is based on the following fundamental result proved in [Theorem 8,[4]].
Lemma 4**.**
*Let J be any index set satisfying ERC. If yJ=PJy satisfies ∥XT(y−yJ)∥∞<σΓ1(1−erc(X,J)), then β^ satisfies the following.
A1). supp(β^)⊆J.
A2). β^ is the unique minimizer of l1-penalty.
A3). T={j:∣bJ(j)∣>Γ1σ∥(XJTXJ)−1∥∞,∞}⊆supp(β^), where bJ=XJ†y is the LS estimate of βJ.*
In words, Lemma 4 states that if the correlation between the columns in X and residual generated by the LS fit using the columns in J is sufficiently low, then the support of solution to l1-penalty will be contained in J. Further, l1-penalty does not miss indices that has sufficiently large values in the restricted LS estimate bJ. By the hypothesis of Theorem 4, the true support I satisfies erc(X,I)<1. Thus, if the event E1={∥XT(y−yI)∥∞<Γ1σ(1−erc(X,I))} is true, then supp(β^)⊆I. That is, l1-penalty does not make any false discoveries. If the event E2={∀j:∣bI(j)∣>Γ1σ∥(XITXI)−1∥∞,∞}={∣T∣=k∗} is also true, then supp(β^)=I. Thus P(I^=I)≥P(E1∩E2).
We first analyse the probability of the event E1.
Note that (In−PI)Xβ=(In−PI)XIβI=0. Hence ∥XT(In−PI)y∥∞=∥XT(In−PI)w∥∞. Further, ∥Xj∥2=1 and Cauchy Schwartz inequality implies that jmax∣XjT(In−PI)w∣≤jmax∥Xj∣∣2∣∣(In−PI)w∥2=∥(In−PI)w∥2. Using these inequalities, we can bound P(E1) as
[TABLE]
Note that σ2∥(In−PI)w∥22∼χn−k∗2 is a B.I.P R.V with distribution independent of σ2.
Hence, if the condition σ2→0limΓ1=∞ in the hypotheses of Theorem 4 is satisfied, then the lower bound in (12) converges to 1. Hence, σ2→0limP(E1)=1.
Next, we analyse P(E2). Since I is the correct support, it follows that bI=XI†(XIβI+w)=βI+XI†w. Since w∼N(0n,σ2In), we have bI∼N(βI,σ2(XITXI)−1). The set T in A3) of Lemma 4 can be rewritten as T={j:∣bI(j)∣>σcjΓ1dj}, where cj=((XITXI)−1)j,j and dj=cj∥(XITXI)−1∥∞,∞.
The NSC for the high SNR consistency of a threshold based SSP like this is given below.
Lemma 5**.**
*Let z∼N(u,σ2C) and K=supp(u). Consider the threshold based estimator K^={j:∣zj∣>σCj,jΓ} of K. Define the event false discovery F={∃j∈K^andj∈/K} and missed discovery M={∃j∈/K^andj∈K}. Then the following statements are true[8].
Hence, if Γ1 satisfies σ2→0limσΓ1=0, then by L2) of Lemma 5, all entries in I will be included in T at high SNR. Mathematically, σ2→0limP(E2)=σ2→0limP(∣T∣=k∗)=1. Since, σ2→0limP(E1)=1 and σ2→0limP(E2)=1, it follows that σ2→0limP(I^=I)≥σ2→0limP(E1∩E2)=1.
∎
IV-B On the choice of SNR adaptation in Γ1.
Theorem 4 states that all SNR adaptations on Γ1 satisfying σ2→0limΓ1=∞ and σ2→0limσΓ1=0 results in the high SNR consistency of l1-penalty. However, the choice of SNR adaptation has profound influence on the performance of l1-penalty in the moderate to high SNR range. In this section, we derive convergence rates for P(E1) and P(E2) discussed in the proof of Theorem 4. First consider the event E1={∥XT(y−yI)∥∞<Γ1σ(1−erc(X,I))}. Following (12), we have
[TABLE]
where A∼χn−k∗2. Let X∼χk2 and a2>k. Then by Lemma 10 in [17], we have
[TABLE]
Let b1=1−erc(X,I) and b2=(n−k∗)2n−k∗exp(2n−k∗). Applying (14) in (13) gives,
[TABLE]
The R.H.S of inequality in (15) is independent of σ2 for the SNR independent Γ1 discussed in literature. Further, the inequality (15) converges to one faster as the growth of Γ1 increases. Let Γ1=σα1 be the SNR adaptation in Γ1. This adaptation satisfies Theorem 4 if 0<α<1. The convergence rate of P(E1) will be faster for α1 than that of α2 if α1>α2.
Next consider the event E2={∀j:∣bI(j)∣>Γ1σcjdj}, where bI=XI†y, cj=((XITXI)−1)j,j and dj=cj∥(XITXI)−1∥∞,∞ as used in the proof of Theorem 4. The following set of inequalities follows directly from union bound and the bI(j)∼N(βj,σ2cj2) distribution of bI(j).
[TABLE]
Applying Q(x)>0,∀x, gives
[TABLE]
For the ease of exposition assume that βj<0,∀j∈I.
Since, σ2→0limσΓ1=0, we have σ2→0lim(−Γ1dj−σcjβj)=∞. Hence, ∃σ12>0 such that −Γ1dj−σcjβj>2,∀j. Using the bound Q(x)≤21exp(−2x2),∀x>2, we have
[TABLE]
∀σ2<σ12. Unlike the bound (15) on P(E1), the R.H.S in (18) increases with the signal strength ∣βj∣. Further, the convergence rate of P(E2) decreases with the increase in the rate at which Γ1 increase to ∞. For Γ1=σα1, the convergence rate of P(E2) decreases with increase in α.
We now make the following observations on the choice of SNR adaptations based on (15) and (18). Consider SNR adaptations of the form Γ1=σα1. When signal strength is low, i.e., βj is low for some j∈I, it is reasonable to choose slow rates for Γ1 like α=0.1. This will ensure the increase of P(E1) to one at a descent rate without causing significant decrease in the convergence rates of P(E2). However, when the signal strength is high, i.e., βj is high for all j∈I, P(E2) will be close to one for moderate values of SNR for most values of 0<α<1. Then the gain in the convergence rate of P(E1) by allowing a larger value of α will overpower the slight decrease in the convergence rate in P(E2). Hence, when signal strength is high, one can choose faster SNR adaptations like α=0.5.
IV-C * High SNR consistency of l1-penalty: Necessary conditions*
In the following, we establish the necessity of SNR adaptations detailed in Theorem 4 for high SNR consistency.
Theorem 5**.**
Suppose ∃J⊃I such that the matrix support pair (X,J) satisfy ERC. Then l1-penalty is high SNR consistent only if σ2→0limΓ1=∞.
Proof.
Let J⊃I be an index set satisfying ERC. Define the events E1:{∥XT(y−yJ)∥∞<Γ1σ(1−erc(X,J))} and E2:{∣T∣=∣J∣}, where T={j:∣bJ(j)∣>cjσΓ1dj}, cj=((XJTXJ)−1)j,j and dj=cj∥(XJTXJ)−1∥∞,∞. yJ=PJy and bJ=XJ†y are the same as in Lemma 4. If both these events are true, then by Lemma 4, I^=supp(β^)=J⊃I. Hence, PE=P(I^=I)≥P(E1∩E2). Since I⊂J, y−yJ=(In−PJ)y=(In−PJ)w. Replacing I with J in (12), we have
[TABLE]
where A=σ2∥(In−PJ)w∥22∼χn−∣J∣2 is a R.V with distribution independent of σ2 and support in (0,∞). Hence, as long as σ2→0limΓ1>0, σ2→0limP(A<Γ12(1−erc(X,J))2)>0, which in turn imply that σ2→0limP(E1)>0.
We next consider P(E2). Since I⊂J, we have XIβI=XJβJ with appropriate zero entries in βJ. Thus, bJ∼N(βJ,σ2(XJTXJ)−1). Hence, ∣T∣=∣J∣ iff a false discovery is made in the thresholding procedure which gives T. From Lemma 5, it follows that σ2→0limP(E2)=σ2→0limP(∣T∣=∣J∣)>0 as long as σ2→0limΓ1<∞.
A careful analysis of the events E1 and E2 reveals that E1 depends only on the component (In−PJ)w of w and E2 depends only on the component PJw. Since, these two components are orthogonal and w is Gaussian, it follows that E1 and E2 are mutually independent, i.e., P(E1∩E2)=P(E1)P(E2). Since, σ2→0limP(E1)>0 and σ2→0limP(E2)>0, it follows that σ2→0limPE≥σ2→0limP(E1∩E1)=σ2→0limP(E1)σ2→0limP(E2)>0, unless σ2→0limΓ1=∞.
∎
It must be mentioned that an index set J⊃I satisfying ERC need not exist in all situations where I satisfy ERC. In that sense, Theorem 5 is less general than Theorem 4. Nevertheless, Theorem 5 is applicable in many practical settings. For example, if X∈Rn×p is orthonormal, then X satisfies ERC for all possible index sets J⊆[p]. Similarly, if X satisfies the MIC of order j, i.e., μX≤2j−11 and j>k∗, then X satisfies ERC for all j>k∗ sized index sets. In both these situations, an index set (in fact many) J⊃I satisfying ERC exists and l1-penalty will be inconsistent without the required SNR adaptation. Theorem 5 proves that the values of Γ1 discussed in literature makes l1-penalty high SNR inconsistent even in the simple case of orthonormal design matrix. Next we establish the necessity of σ2→0limσΓ1=0 for high SNR consistency.
Theorem 6**.**
Suppose that the matrix support pair (X,I) satisfy ERC and k∗≥1. Then, l1-penalty will be high SNR consistent only if σ2→0limσΓ1=0.
Proof.
Let J be any index set satisfying J⊂I. Since, I satisfy ERC, J will also satisfy ERC. Consider the event E:{∥XT(y−yJ)∥∞<Γ1σ(1−erc(X,J))}, where yJ=PJy. If E is true, then by Lemma 4, I^=supp(β^)⊆J⊂I. Thus, PE≥P(E). The following bound on P(E) follows from Cauchy Schwartz inequality and the unit l2 norm of Xj.
[TABLE]
where A=σ2∥(In−PJ)y∥22∼χn−∣J∣2(σ2λ) and λ=∥(In−PJ)Xβ∥22>0. By Lemma 3, σ2A→Pλ as σ2→0. Hence, if σ2→0limσ2Γ12(1−erc(X,J))2>λ, then as shown in the proof of Theorem 3, σ2→0limP(σ2A<σ2Γ12(1−erc(X,J))2)=1. However, λ is unknown. Thus, to satisfy σ2→0limσ2Γ12(1−erc(X,J))2<λ, it is necessary that σ2→0limσ2Γ12=0 which is equivalent to σ2→0limσΓ1=0.
∎
A widely used formulation of l1-penalty is given by β^=b∈Rpargmin21∥y−Xb∥22+λ∥b∥1 which is equivalent to the formulation in this article by setting λ=Γ1σ. In this formulation, l1-penalty is high SNR consistent if σ2→0limσλ=∞ and σ2→0limλ=0. An interesting case is that of a fixed σ independent λ like λ=0.1. This choice of λ satisfy σ2→0limσλ=∞ which is a necessary condition for high SNR consistency. However, the satisfiability of the necessary condition in Theorem 6 depends upon on the signal β (Please see the proof of Theorem 6). Hence, when a priori knowledge of β is not available, a fixed regularization parameter is not advisable from the vantage point of high SNR consistency.
IV-D High SNR consistency of l1-error
We next discuss the high SNR behaviour of β^=b∈Rpargmin∥b∥1,subject to∥y−Xb∥2≤Γ2σ and I^=supp(β^). l1-penalty is the Lagrangian of the constrained optimization problem given by l1-error. The performance of l1-error is dictated by the choice of Γ2. Commonly used choice of Γ2 include Γ2=n+22n[36], Γ2=n+2nlog(n)
[37] etc. A high SNR analysis of l1-error in terms of variable selection properties is not available in open literature to the best of our knowledge. The following theorem states the sufficient conditions for l1-error to be high SNR consistent.
Theorem 7**.**
l1-error is high SNR consistent for any matrix support pair (X,I) satisfying the ERC provided that the tuning parameter Γ2 satisfies σ2→0limΓ2=∞ and σ2→0limσΓ2=0.
Proof.
The proof of Theorem 7 is based on the result in [Theorem 14,[4]] regarding the minimizers of l1-error.
Lemma 6**.**
Let J be any index set satisfying ERC. If yJ=PJy satisfies
[TABLE]
*then β^ satisfies the following.
A1). supp(β^)⊆J.
A2). β^ is the unique minimizer of l1-error.
A3). T={j:∣bJ(j)∣>Γ2σ∥XJ†∥2,2}⊆supp(β^).
Here bJ=XJ†y is the LS estimate of βJ.*
In words, Lemma 6 states that if the residual between y and the LS fit of y using the columns in XJ has sufficiently low correlation with the columns in X and sufficiently low l2 norm, then the support of the solution to l1-error will be contained in J. Further, l1-error does not miss indices that have sufficiently large values in the restricted LS estimate bJ. By the hypothesis of Theorem 7, the true support I satisfies erc(X,I)<1. Thus, if the event E1={(\refequationlemmal1error) is satisfied}, then I^=supp(β^)⊆I. If the event E2={∀j:∣bI(j)∣>Γ2σ∥XI†∥2,2}={∣T∣=k∗} is also true, then supp(β^)=I. Thus P(I^=I)≥P(E1∩E2).
We first analyse the probability of the event E1. By Cauchy Schwartz inequality and the fact that ∥Xj∥2=1, we have ∣XjT(y−yI)∣≤∥y−yI∥2,∀j. Thus, ∥XT(y−yI)∥∞2≤∥y−yI∥22. Hence, Γ22σ2>∥y−yI∥22aI, where aI=(1+(1−erc(X,I))2∥XI†∥2,12) implies (21). Thus,
[TABLE]
Note that y−yI=(In−PI)y=(In−PI)w. Hence, A=σ2∥y−yI∥22∼χn−k∗2. Since, χn−k∗2 is a B.I.P R.V with σ2 independent distribution, it follows that σ2→0limP(A<Γ22aI−1)=1 if σ2→0limΓ2=∞. This implies that σ2→0limP(E1)=1.
Next we consider the event E2. The index set T in Lemma 6 can be rewritten as T={j:∣bI(j)∣>σcjΓ1dj}, where cj=(XITXI)j,j−1 and dj=cj∥XJ†∥2,2. The event {∣T∣=k∗} happens iff there is no missed discovery in the thresholding procedure generating T. Then it follows from σ2→0limσΓ1=0 and L2) of Lemma 5 that σ2→0limP(E2)=σ2→0limP(∣T∣=k∗)=1. Since, σ2→0limP(E1)=1 and σ2→0limP(E2)=1, it follows that σ2→0limP(I^=I)≥σ2→0limP(E1∩E2)=1.
∎
The following theorem states that the SNR adaptations outlined in Theorem 7 are necessary for high SNR consistency.
Theorem 8**.**
*The following statements regarding the high SNR consistency of l1-error are true.
1). Suppose ∃J⊃I such that the matrix support pair (X,J) satisfy ERC. Then l1-error is high SNR consistent only if σ2→0limΓ2=∞.
2). Suppose that the matrix support pair (X,I) satisfy ERC and k∗≥1. Then, l1-error will be high SNR consistent only if σ2→0limσΓ2=0.*
Note that the values of Γ2 discussed in literature do not satisfy the NSCs outlined in Theorems 7 and 8. Hence, l1-error with these values of Γ2 will be inconsistent at high SNR.
IV-E * Analysis of Dantzig selector based SSP*
Here, we discuss the high SNR behaviour of β^ given by
β^=b∈Rpargmin∥b∥1,subject to∥XT(y−Xb)∥∞≤Γ3σ.
and I^=supp(β^). The properties of DS is determined largely by the hyper parameter Γ3. Commonly used values include Γ3=2log(p) [2], Γ3=(23+2log(p)) [38] etc. No high SNR consistency results for DS is reported in open literature to the best of our knowledge. Next we state and prove the NSCs for the high SNR consistency of DS when X is orthonormal.
Theorem 9**.**
For an orthonormal design matrix X, DS is high SNR consistent iff σ2→0limΓ3=∞ and σ2→0limσΓ3<j∈Imin∣βj∣.
Proof.
When X is orthonormal, the solution to DS is given by β^j=(∣(XTy)j∣−Γ3σ)+sign((XTy)j),∀j. Note that (x)+=x if x>0 and (x)+=0 if x≤0. Thus I^ is obtained by thresholding the vector ∣XTy∣ at level σΓ3. Since, X is orthonormal, we have XTy∼N(β,σ2Ip). The proof now follows directly from Lemma 5.
∎
Note that no a priori knowledge of β is available. Hence, to achieve consistency, it is necessary that σ2→0limσΓ3=0. It follows directly from Theorem 9 that the values of Γ3 discussed in literature are inconsistent for orthonormal matrices. This implies the inconsistency of these tuning parameters in regression classes based on μX and ERC which includes orthonormal matrices too. We now make an observation regarding the NSCs developed for l0-penalty, l1-error, l1-penalty and DS.
Remark 5*.*
The SNR adaptations prescribed for high SNR consistency have many similarities. Even though l0-penalty requires Γ0σ2→0 whereas other algorithms requires Γiσ→0,
the effective regularization parameter, i.e., λ0=Γ0σ2 for l0-penalty and λi=Γiσ for other algorithms satisfies λi→0 as σ2→0. In the absence of noise (i.e σ2=0) equality constrained optimization problems (2) and (3) will correctly recover the support of β under spark and ERC assumptions respectively. Further, when the effective regularization parameter λi→0, l0-penalty automatically reduces to (2), whereas, l1-penalty, l1-error and DS reduces to (3). Hence, the condition λi→0 as σ2→0 is a natural choice to transition from the formulations for noisy data to the equality constrained l0 or l1 minimization ideal for noiseless data.
V Analysis of Orthogonal Matching Pursuit
OMP[21, 22, 23] is one of most popular techniques in the class of greedy algorithms to solve CS problems. Unlike the CR techniques like l1-penalty which has a computational complexity O(np2), OMP has a complexity of only O(npk∗). Consequently, OMP is more easily scalable to large scale problems than CR techniques. Further, the performance guarantees for OMP are only slightly weaker compared to CR techniques. An algorithmic description of OMP is given below.
Step 1:
Initialize the residual r0=y. Support estimate J0=ϕ. Iteration counter i=1;
2. Step 2:
Find the column index most correlated with the current residual ri−1, i.e., ti=t∈[p]argmax∣XtTri−1∣.
3. Step 3:
Update support estimate: Ji=Ji−1∪ti.
4. Step 4:
Update residual: ri=(In−PJi)y.
5. Step 5:
Repeat Steps 2-4, if stopping condition (SC) is not met, else, output I^=Ji.
The properties of OMP is determined by the SC. A large body of literature regarding OMP assumes a priori knowledge of sparsity level of β, i.e., k∗ and run k∗ iterations of OMP[21, 22]. When k∗ is not known, two popular SCs for OMP are discussed in literature. One SC called residual power based stopping condition (RPSC) terminate iterations when the residual power becomes too low (i.e., ∥ri∥2<σΓ4) and other SC called residual correlation based stopping condition (RCSC) terminate iterations when the maximum correlation of columns in X with the residual becomes too low (i.e., ∥XTri∥∞<σΓ5). A commonly used value of Γ4 is Γ4=n+2nlog(n) and that of Γ5 is Γ5=2(1+η)log(p)[23]. Here η>0 is a constant. The following theorems state the sufficient conditions for OMP with RPSC and RCSC to be high SNR consistent.
Theorem 10**.**
OMP with RPSC is high SNR consistent for any matrix X and signal β satisfying the ERC provided that the hyper parameter Γ4 satisfies σ2→0limΓ4=∞ and σ2→0limσΓ4=0.
Theorem 11**.**
OMP with RCSC is high SNR consistent for any matrix X and signal β satisfying the ERC provided that the hyper parameter Γ5 satisfies σ2→0limΓ5=∞ and σ2→0limσΓ5=0.
Let us consider the two processes- OMP iterating without SC (P1) and verification of the SC (P2) separately. Specifically P1 returns a set of indexes in order, say {t1,t2,…} and P2 returns a single index j indicating where to stop. Then, the support estimate is given by I^={t1,…,tj} and the indices after j, i.e., {tj+1,…} will be discarded. Let E1 denotes the event {t1,…,tk∗}=I, i.e., the first k∗ iterations of OMP returns all the k∗ indices in I and E2 denotes the event {P2 returns k∗}. Then P(I^=I)=P(E1∩E2).
Let Ni=∥XT(In−PJi−1)w∥∞ denotes the maximum correlation between the columns in X and noise component in the current residual ri−1 and βmin=j∈Imin∣βj∣ denotes the minimum non-zero value in β. Then, using the analysis in Section @slowromancapv@ of [23], Ni<cIβmin, where cI=2k∗(1−erc(X,I))λmin(XITXI) is a sufficient condition for selecting an index from I in the ith iteration (∀i≤k∗). Since, ∥Xj∥2=1, it follows that Ni≤∥(In−PJi−1)w∥2. Thus, P(E1)≥P(i=1…,k∗∩{∣∣(In−PJi−1)w∣∣2<cIβmin}). One can bound P(E1C) using union bound and the inequality ∥(In−PJi−1)w∥2≤∥w∥2 as
[TABLE]
where Z=σ2∥w∥22∼χn2. Since, Z is a B.I.P R.V with distribution independent of σ2 and σ2cI2βmin2→∞ as σ2→0, we have σ2→0limP(Z>σ2cI2βmin2)=0. This implies that σ2→0limP(E1)=1. To summarize, if erc(X,I)<1 and OMP runs exactly k∗ iterations, then the true support can be detected exactly at high SNR.
The conditional probability P(E2∣E1) is given by P(E2∣E1)=P({SC is not satisified fori=1,…,k∗−1}∩{SC is satisfied fori=k∗}∣E1). Complementing and applying union bound gives
[TABLE]
Proof of Theorem 10:- For RPSC, the SC is given by {∥ri∥2<σΓ4}. First consider P(Si)=P(∥ri∥2<σΓ4) for i<k∗ in (24). Using triangle inequality, ∥ri∥2≥∥(In−PJi)Xβ∥2−∥(In−PJi)w∥2. Conditioned on E1, we have Ji⊂I for i<k∗ and hence ∃λi>0 such that ∥(In−PJi)Xβ∥2>λi, for all σ2>0. Further, ∥(In−PJi)w∥2≤∥w∥2. Applying these bounds in P(Si)=P(∥ri∥2<σΓ4) gives
[TABLE]
Since, w∼N(0n,σ2In), we have ∥w∥2→P0 as σ2→0. By the hypothesis of Theorem 10, σ2→0limσΓ4=0. Hence, ∥w∥2+σΓ4→P0 as σ2→0. Now by the definition of C.I.P, σ2→0limP(∥w∥2+σΓ4>λi)=0.
This implies that σ2→0limP(Si)=0,∀i<k∗.
Next consider P(Sk∗) in (24). Conditioned on E1, all the first k∗ iterations of OMP are correct, i.e., Jk∗=I. This implies that ∥rk∗∥22=∥(In−PI)y∥22=∥(In−PI)w∥22∼σ2χn−k∗2. Consequently, P(Sk∗)=P(∥rk∗∥22>σ2Γ42)=P(Z>Γ42), where Z=σ2∥rk∗∥22∼χn−k∗2. Since Z is a B.I.P R.V with distribution independent of σ2 and Γ4→∞ as σ2→0, it follows that σ2→0limP(Sk∗)=0. Substituting σ2→0limP(Si)=0 for i≤k∗ in (24), we have σ2→0limP(E2∣E1)=1. Combining this with σ2→0limP(E1)=1 gives σ2→0limP(I^=I)=σ2→0limP(E1∩E2)=σ2→0limP(E1)σ2→0limP(E2∣E1)=1.∎
Proof of Theorem 11:- For RCSC, the SC is given by {∥XTri∥∞<σΓ5}. First consider P(Si)=P(∥XTri∥∞<σΓ5) for i<k∗ in (24). ∥XTri∥∞ can be lower bounded using triangle inequality as ∥XTri∥∞≥∥XT(In−PJi−1)Xβ∥∞−∥XT(In−PJi−1)w∥∞. Further, ∥Xi∥2=1 implies that ∥XT(In−PJi−1)w∥∞≤∥(In−PJi−1)w∥2≤∥w∥2.
Conditioned on E1, we have Ji⊂I for i<k∗ and hence ∃λi>0 such that ∥XT(In−PJi)Xβ∥∞>λi, for all σ2>0. Applying these bounds in P(Si)=P(∥XTri∥∞<σΓ5) gives
[TABLE]
Following the same arguments used in the proof of Theorem 10 we have σ2→0limP(Si)=0,∀i<k∗.
Next we consider P(Sk∗)=P(∥XTrk∗∥∞>σΓ5). Since, the first k∗ iterations are correct, i.e., Jk∗=I, we have ∥XTrk∗∥∞=∥XT(In−PI)y∥∞=∥XT(In−PI)w∥∞. Using Cauchy Schwartz inequality and ∥Xj∥2=1, it follows that ∥XT(In−PI)w∥∞≤∥(In−PI)w∥2. Hence, P(Sk∗)≤P(σ2∥(In−PI)w∥22>Γ52). Since, σ2∥(In−PI)w∥22∼χn−k∗2 is a B.I.P R.V and the term Γ52→∞, it follows that σ2→0limP(Sk∗)=0. Substituting σ2→0limP(Si)=0 for i≤k∗ in (24), we have σ2→0limP(E2∣E1)=1. Combining this with σ2→0limP(E1)=1 gives σ2→0limP(I^=I)=σ2→0limP(E1∩E2)=σ2→0limP(E1)σ2→0limP(E2∣E1)=1. ∎
Remark 6*.*
The following observations can be made about the convergence rates in RPSC and RCSC. The rate at which P(E1) converges to one is independent of Γ4 or Γ5. First consider P(Si) for i<k∗ and let Γ4=σα1 be the SNR adaptation. Then the rate at which P(Si) converges to zero is maximum when α=0 and decreases with increasing α. However, the rate at which P(Sk∗) converges to zero increases with increasing α.
VI Numerical Simulations
Here we numerically verify the results proved in Theorems 1-11. We consider two classes of matrices for simulations.
ERC matrix: We consider a n×2n matrix X formed by the concatenation of a n×n identity matrix and a Hadamard matrix of size n×n denoted by Hn, i.e., X=[In,Hn.]. It is well known that this matrix has mutual coherence μX=n1[Chapter 2,[39]]. We fix n as n=32 and for this value of n, X satisfy MIC for any β with sparsity k∗≤21(1+n)=3.3284. As explained in section @slowromancapii@, MIC implies ERC also.
Random matrix: A random matrix X is generated using i.i.dXi,j∼N(0,1) R.Vs and columns in this matrix are later normalised to have unit l2 norm. In each iteration the matrix X is independently generated. The matrix support pair thus generated in each iteration may or may not satisfy ERC.
All non zero entries have same magnitude (denoted by βk in figures) but random signs. Further, the k∗ non zero entries are selected randomly from the set [p]. The figures are produced after performing 105 iterations at each SNR level.
VI-A Performance of l0-penalty.
The performance of l0-penalty with different values of Γ0 is reported in Fig.1. The matrix under consideration is a 5×10 random matrix. “Known k∗” represents the performance of an oracle SSP with a priori information of k∗. This SSP estimates I^ using I^=J⊂[p],∣J∣=k∗argmin∣∣(In−PJ)y∣∣22 and will have superior performance when compared with l0-penalty which is oblivious to k∗.
L.H.S of Fig.1 gives the performance of l0-penalty with SNR independent values of Γ0 discussed in literature. AIC uses Γ0=2, BIC uses Γ0=log(n), RIC-FG uses Γ0=2log(p)[29], RIC-ZS uses Γ0=2log(p)+2log(log(p)) [30] and EBIC uses Γ0=log(n)+∣∣b∣∣02log((∣∣b∣∣0p)). As predicted by Theorem 2, l0-penalty with all these values of Γ0 are inconsistent at high SNR. The performance of RIC-ZS is the best among the values of Γ0 under consideration. The performance of BIC and AIC are much poorer compared to other schemes. When n=5, Γ0=2 in AIC is bigger than Γ0=log(n) of BIC and this explains the inferior performance of BIC viz a viz AIC. For higher values of n, BIC will perform better than AIC.
R.H.S gives the performance of l0-penalty with Γ0=f(σ2)[log(n)+∥b∥02log((∥b∥0p))], i.e., a SNR adaptation is added to EBIC penalty. ‘‘log(σ21)′′ in Fig.1 represents f(σ2)=log(σ21). This SNR adaptation satisfies the conditions in Theorem 1 and is common in popular MOS criteria like NMDL, g-MDL etc. [17]. The schemes represented using α=(.) has f(σ2)=σα1. Among the values of α considered in Fig.1, α=0.5 and α=1 satisfies the conditions in Theorem 1, α=−0.3 violates Theorem 2 and α=2.1 violates Theorem 3 respectively. As predicted by Theorems 1-3, only ‘‘log(σ21)′′, α=0.5 and α=1 that satisfies the conditions in Theorem 1 are high SNR consistent. This verify the NSCs derived in section @slowromancapiii@. Further, the performance of l0-penalty with Γ0 represented by “log(σ21)” and α=0.5 are very close to the optimal scheme represented by “Known k∗” across the entire SNR range. This suggest the finite SNR utility of the SNR adaptations suggested by Theorem 1.
VI-B Performance of l1-penalty and l1-error at high SNR.
L.H.S of Fig.2 gives the performance of l1-penalty and R.H.S of Fig.2 gives the performance of l1-error respectively. Both these SSPs are evaluated for the 32×64 ERC matrix previously defined and a 75×100 random matrix. “22log(p)” in L.H.S represents the performance of l1-penalty with Γ1=22log(p) [34] and “α=(.)“ represents l1-penalty with Γ1=σα122log(p). Similarly, in the R.H.S, “n+22n” represents the l1-error with Γ2=n+22n [36] and “α=(.)” represents l1-error with Γ2=σα1n+22n. In both cases, α=(.) incorporates a SNR adaptation into a well known value of Γ1 and Γ2. By Theorems 4-8, these SNR adaptations are consistent iff 0<α<1.
First we consider the performance of l1-penalty for the matrix X satisfying ERC. It is clear from Fig.2 that l1-penalty with Γ1=22log(p) floors at high SNR with a PE≈10−2.5. Hence, l1-penalty with Γ1=22log(p) is inconsistent at high SNR and this validates Theorem 5. Other σ2 independent values of Γ1 discussed in Section @slowromancapiv@ also floors at high SNR. On the contrary, l1-penalty with SNR dependent Γ1 does not floor at high SNR and this validates Theorem 4. Further, Γ1 with α=0.1 performs better than Γ1=22log(p) even for σ2≈0.01. In the same setting, l1-error with Γ2=n+22n is inconsistent at high SNR. In fact PE for Γ2=n+22n floors at PE≈10−1.25 at high SNR. It is evident from Fig.2 that Γ2=σα1n+22n, where α=0.15 and α=0.3 are high SNR consistent. These results validates Theorems 7-8. In fact l1-error with α=0.15 and α=0.3 performs much better than the SNR independent Γ2=n+22n from σ2≈0.01 onwards.
Next we consider the performance of l1-penalty and l1-error when X is a random 75×100 matrix. Here also l1-penalty and l1-error with values of Γ1 and Γ2 independent of σ2 floors at high SNR. However, unlike the case of ERC matrix, Γ1 and Γ2 with SNR adaptations stipulated by Theorem 4 and Theorem 7 appears to floor at high SNR. This is because of the fact that there is a non zero probability perc>0 with which a particular realization of (X,β) pair fails to satisfy conditions like ERC. In fact perc decreases exponentially with increasing n. Hence, for random matrices perc dictates the PE at which l1-penalty and l1-error floors. Note that the level at which PE of l1-penalty with Γ1 and Γ2 satisfying the SNR adaptations stipulated by Theorem 4 and Theorem 7 floors is significantly lower than the case with SNR independent Γ1 and Γ2. This indicates that the proposed SNR adaptations can improve performance in situations beyond the regression classes for which high SNR consistency is established.
VI-C Performance of OMP at high SNR.
L.H.S of Fig.3 presents the performance of OMP with RPSC and R.H.S presents the performance of OMP with RCSC respectively. Both these SSPs are evaluated for the ERC matrix previously defined and a 75×100 random matrix. “Known k∗” represents a hypothetical SSP which runs OMP for exactly k∗=3 iterations. “fn” in the L.H.S represents the performance of RPSC with Γ4=n+2nlog(n) and “α=(.)” represents the performance of RPSC with Γ4=σα1n+2nlog(n). Similarly, “fp” in the R.H.S represents the performance of RCSC with Γ5=4log(p) and “α=(.)” represents the performance of RCSC with Γ5=σα14log(p). Γ4=n+2nlog(n) and Γ5=4log(p) are suggested in [23]. “α=(.)” in both cases incorporate a SNR adaptation into these well known stopping parameters. It is clear from the Fig.3 that OMP with SC independent of σ2 floors at high SNR for both ERC and random matrices, whereas, the flooring of PE is not present in OMP with SC satisfying Theorems 10 and 11 for ERC matrix. For random matrix, the performance of OMP with proposed SNR adaptations floors at a PE level equal to that of OMP with known k∗. This flooring is also due to the causes explained for l1-penalty and l1-error.
VI-D * On the choice of SNR adaptations.*
Fig.4 presents the effect of signal strength ∣βj∣ and SNR adaptations on the convergence rates of l1-penalty and OMP-RPSC. “fn” represents RPSC with Γ4=n+22log(n) as before. “α=(.)” represents l1-penalty with Γ1=σα122log(p) and RPSC with Γ4=σα1n+22log(n). By Theorems 4 and 10, the SNR adaptations represented by α=(.) will be consistent for both l1-penalty and RPSC iff 0<α<1. However, the deviations from the base tuning parameters (i.e., 22log(p) and n+22log(n)) will be more pronounced as α increases. This will influence the rate at which PE converges to zero.
At very high SNR, the performance of l1-penalty and OMP-RPSC with larger values of α will be better. This is true for both low and high values of regression coefficients (i.e., βj=0.5 and βj=3). Throughout the moderate to high SNR range, the performance of these algorithms with high values of α will be poor in comparison with the base tuning parameter when ∣βj∣ is low. In the same SNR and signal strength regime the performance with low values of α will be better than both base tuning parameter and high value of α. As the signal strength improves, the performance of these algorithms improves for all values of α. However, the performance with high values of α will be much better than the performance with low values of α when ∣βj∣ is high. Note that the PE with base tuning parameter floors at the same value irrespective of signal strength. The numerical results are in line with the inferences derived from the convergence rate analysis of l1-penalty. Similar inferences can be derived from the numerical experiments (not shown) conducted for other CS algorithms considered in this paper.
Note that the very high SNR regime is rarely encountered in practice. Further, a low value of α will provide a performance atleast as good as the performance of the base parameter in the moderate SNR range irrespective of the signal strength and a progressively improving performance as the SNR or the signal strength improves. Hence, by following the philosophy of minimizing the worst case risk, it will be advisable to choose smaller values of α like α=0.1 for practical applications.
VII Conclusion
NSCs for the high SNR consistency of CS algorithms like l0-penalty, l1-penalty, l1-error, DS and OMP are derived in this paper. Aforementioned algorithms with the tuning parameters discussed in literature are analytically and numerically shown to be inconsistent at high SNR. Novel tuning parameters for these CS algorithms are derived based on the sufficient conditions and justified using convergence rate analysis. CS algorithms with the proposed tuning parameters are numerically shown to perform better than existing tuning parameters.
Bibliography39
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] Y. C. Eldar and G. Kutyniok, Compressed sensing: Theory and applications . Cambridge University Press, 2012.
2[2] T. T. Emmanuel Candes, “The Dantzig selector: Statistical estimation when p is much larger than n,” Ann. Stat. , vol. 35, no. 6, pp. 2313–2351, 2007.
3[3] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory , vol. 50, no. 10, pp. 2231–2242, 2004.
4[4] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory , vol. 52, no. 3, pp. 1030–1051, March 2006.
5[5] D. P. Wipf and B. D. Rao, “Sparse Bayesian learning for basis selection,” IEEE Trans. Signal Process. , vol. 52, no. 8, pp. 2153–2164, 2004.
6[6] M. Masood and T. Y. Al-Naffouri, “Sparse reconstruction using distribution agnostic Bayesian matching pursuit,” IEEE Trans. Signal Process. , vol. 61, no. 21, pp. 5298–5309, 2013.
7[7] Z. Ben-Haim, Y. C. Eldar, and M. Elad, “Coherence-based performance guarantees for estimating a sparse vector under random noise,” IEEE Trans. Signal Process. , vol. 58, no. 10, pp. 5030–5043, 2010.
8[8] K. Sreejith and S. Kalyani, “High SNR consistent thresholding for variable selection,” IEEE Signal Process. Lett. , vol. 22, no. 11, pp. 1940–1944, Nov 2015.