Computational approaches to non-convex, sparsity-inducing multi-penalty regularization
Zeljko Kereta, Johannes Maly, and Valeriya Naumova

TL;DR
This paper investigates efficient algorithms for non-convex multi-penalty regularization in sparse signal reconstruction, introducing a new infimal convolution approach with proven linear convergence and validated by numerical experiments.
Contribution
It extends existing methods to non-convex settings, proposes a computationally efficient infimal convolution approach, and provides convergence analysis with numerical validation.
Findings
Both approaches achieve linear convergence rates.
The infimal convolution method is less dependent on problem size.
Numerical experiments confirm theoretical convergence rates.
Abstract
In this work we consider numerical efficiency and convergence rates for solvers of non-convex multi-penalty formulations when reconstructing sparse signals from noisy linear measurements. We extend an existing approach, based on reduction to an augmented single-penalty formulation, to the non-convex setting and discuss its computational intractability in large-scale applications. To circumvent this limitation, we propose an alternative single-penalty reduction based on infimal convolution that shares the benefits of the augmented approach but is computationally less dependent on the problem size. We provide linear convergence rates for both approaches, and their dependence on design parameters. Numerical experiments substantiate our theoretical findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Computational approaches to non-convex, sparsity-inducing multi-penalty regularization
Željko Kereta Email: [email protected] University College London, United Kingdom
Simula Research Laboratory, Simula Metropolitan Center for Digital Engineering, Norway
Johannes Maly Email: [email protected] KU Eichstaett/Ingolstadt, Germany
Valeriya Naumova Email: [email protected] Simula Research Laboratory, Simula Metropolitan Center for Digital Engineering, Norway
Abstract
In this work we consider numerical efficiency and convergence rates for solvers of non-convex multi-penalty formulations when reconstructing sparse signals from noisy linear measurements. We extend an existing approach, based on reduction to an augmented single-penalty formulation, to the non-convex setting and discuss its computational intractability in large-scale applications. To circumvent this limitation, we propose an alternative single-penalty reduction based on infimal convolution that shares the benefits of the augmented approach but is computationally less dependent on the problem size. We provide linear convergence rates for both approaches, and their dependence on design parameters. Numerical experiments substantiate our theoretical findings.
1 Introduction
In many real-life applications one is interested in recovering a structured signal from few corrupted linear measurements. One particular challenge lies in separating the ground-truth from pre-measurement noise since any such corruption is amplified during the measurement process, a phenomenon known as noise folding [2] or input noise model [1]. It commonly appears in signal processing and compressed sensing applications, where noise is added to the signal both before and after the measurement process occurs. This can be modeled as
[TABLE]
where is an -sparse original signal that we want to recover, is the pre-measurement noise, is the post-measurement noise, and is the measurement matrix. Note that a signal is called -sparse if its support consists of at most elements, i.e. . Information theoretic bounds state that the number of measurements required for the exact support recovery of from (1) needs to scale linearly111Assume for simplicity , , and . We now write (1) as , where represents the effective noise. The covariance matrix of equals . Assuming (as is the case, with high probability, for with zero mean, -variance sub-Gaussian entries), and , we would have , for . Thus, the variance of the noise rises by a factor proportional to , which when can be substantial. with , which leads to poor compression performance [1].
A number of recent studies [3, 21, 16, 15] try and mitigate these issues through a multi-penalty regularization framework defined as
[TABLE]
where are regularization parameters, , and . In particular, to promote sparsity of the component we choose . A natural way to minimize (2) is via alternating minimization, starting from and then iterating as
[TABLE]
Whereas the second problem is differentiable and admits an explicit solution, the first problem requires iterative thresholding for [21], for each outer iteration , and becomes non-convex if . Moreover, alternating minimization does not lend itself to an easy analysis of the convergence rate.
1.1 Contribution
In this work we examine the multi-penalty problem (2), for the case and . We first show that the augmented approach in [16], which allows to decouple the computation of and components of the solution, can be easily extended to to obtain an augmented single-penalty iterative thresholding algorithm providing solutions to (2). Since this includes computing the inverse of a possibly high-dimensional matrix, we suggest an alternative single-penalty iterative thresholding algorithm which is based on an infimal convolution formulation of (2) and sidesteps the computational bottleneck of the augmented approach. We show a linear convergence rate for both approaches, in dependence of design parameters, and in numerical simulations confirm both the rate analysis and the efficiency gap. In particular, we argue that the benefits of faster convergence rates are sometimes offset by the computational demands, which suggests that a preferred method for solving the optimization problem can be chosen with respect to the size of .
1.2 Related Work
In [21] the authors approach (3), for and , on separable Hilbert spaces by applying iterative thresholding algorithms to each of the sub-problems, and show convergence of the sequence of iterates to stationary points of the underlying problem. The choice is of special interest when models uniform pre-measurement noise. However, the authors also show that exhibits the best (empirical) performance for the reconstruction of , for modelling various common noise types (including uniform noise). It is for this reason that in this paper we are concerned only with the case . We add though that more general noise types might be of interest in very particular cases, and this is a possible topic for future research. In [16] the authors reduce the optimization problem (2) to a single-penalty regularization through an augmented data matrix, for and , and derive conditions on optimal support recovery. The authors provide theoretical and numerical evidence of superior performance of multi-penalty regularization over standard single-penalty approaches for the sparse recovery of solutions to (1). In [15] a principled, data-driven parameter selection approach is derived for and , based on the Lasso path. Instead of through noise folding, a multi-penalty formulation of the objective function can also be seen from the perspective of the recovery of a signal that is a superposition of two components, e.g. a sparse and a smooth component. See [12] and references therein. In spite of these and other advances, rigorous results regarding convergence rate and error analysis for (2) have not been established.
Since we reduce (2) to specific single-penalty problems, corresponding convergence results on classical proximal descent methods are of interest. In [9] important insights on support stability and convergence of iterative thresholding algorithms on separable Hilbert spaces have been collected while [28] proved linear convergence rates of the iterative thresholding algorithm, under certain conditions, if the underlying thresholding operator is not continuous, though the dependency on the parameters of the optimization scheme are not explicitly derived. Linear convergence of a single penalty non-convex regularizer with adaptive thresholding was established in [24], where the influence of the RIP of the design matrix on the convergence constant can be inferred. A further survey of nonconvex regularizers for sparse recovery can be found in [25].
Lastly, approaches representing regularizers as infimal convolution can be found in the context of machine learning and signal processing, cf. [17, 18]. Therein primal-dual schemes are examined for optimizing functionals penalized via infimal convolutions. The results, however, require piece-wise convexity which is not given in our case.
1.3 Notation
We restrict boldface lettering to matrices (uppercase), e.g. , and vectors (lowercase), e.g. . The entry of a vector is denoted as . For we denote . For the norm of a vector is denoted by . The support set of is denoted as
[TABLE]
and the sign is defined component-wise by
[TABLE]
For a matrix , we use to denote its spectral norm and to denote its smallest singular value. We denote the identity matrix by . For , represents the submatrix of containing the columns indexed by , and denotes the subvector of containing the entries restricted to . We denote the corresponding orthogonal projection operator onto as , so that . When indexed by a set , denotes the orthogonal projection onto . Finally, the set-valued operator denotes the limiting Fréchet subdifferential, and is its corresponding domain when applied to a function , cf. [23, 20].
2 Main Results
Consider the multi-penalty problem (2) for , i.e. minimizing
[TABLE]
and denote a corresponding solution pair by
[TABLE]
As mentioned above , , are regularization parameters balancing the contributions of the data-fidelity term and the two regularization terms, and .
Let us introduce two widely known concepts relevant for the forthcoming discussion. First, the Kurdyka-Łojasiewicz (KŁ) property; a well-established tool for analyzing the convergence, and convergence rates, of proximal descent algorithms [4].
Definition 2.1**.**
A function is said to have the KŁ property at if there exists , a neighbourhood of , and a continuous concave function such that
, and for all 2. 2.
For all the KŁ inequality holds
[TABLE]
The KŁ property is used to describe the speed of convergence through the desingularizing function . It has been shown that semi-algebraic functions satisfy the KŁ property with , where and is called the KŁ constant, which characterizes the convergence speed of proximal gradient descent algorithms [4, Theorem 11]. As observed in [8], Corollary 3.6 in [19] may be used to determine the KŁ constant of piecewise convex polynomials. Even though has the KŁ property, cf. [5, Example 5.4], it does not result in piece-wise convex polynomials for , and thus we cannot apply [19, Corollary 3.6] to infer the speed of convergence. We will instead adopt and adapt the ideas from [9, 28].
The second concept relevant for this paper is the restricted isometry property (RIP), which allows to control eigenvalues of small submatrices of , and to characterize measurement operators that allow stable and robust reconstruction of sparse signals from measurements.
Definition 2.2**.**
A matrix satisfies the restricted isometry property of order (-RIP) with constant , if for all -sparse
[TABLE]
Remark 2.3**.**
For a detailed treatment of RIP, and measurement operators that fulfill it, we refer the reader to [14]. Let us only mention that if the entries of are i.i.d. copies of a Gaussian random variable with mean zero and variance , then
[TABLE]
measurements suffice to have an -RIP with constant with high probability, for an absolute constant . Consequently, \delta_{s}={\cal O}\big{(}m^{-1/2}\sqrt{s\log(en/s)}\big{)} with high probability.
2.1 Augmented Formulation
It was observed in [16] that for , the multi-penalty problem (2) reduces to single-penalty regularization where measurement matrix and datum are adjusted by the regularization parameter . We include this result, extended to , together with the proof (see Section A.1), which is analogous to [16, Lemma 1].
Lemma 2.4**.**
The pair minimizes in (4) if and only if
[TABLE]
and is the solution of the augmented problem
[TABLE]
with
[TABLE]
Remark 2.5**.**
*The noise folding forward model (1) is in [2] written in the whitened form as , for , , , for and is a constant. Notice that this is particularly related to the augmented problem in (7).
On an unrelated note, improving on the analysis in [2, Proposition 2] one can show (see Lemma B.1) that the coherence, defined for a matrix as*
[TABLE]
where is the -th column of , of the augmented measurement matrix satisfies
[TABLE]
In compressed sensing literature, the magnitude of the coherence of a matrix is an important measure of quality for measurement matrices, cf. [14, Section 5]. The bound in (8) thus suggests that for small or large , the linear measurement process modelled by is as information preserving as the one modelled by . In addition, Lemma B.2 shows that behaves like the coherence of a conditioned version of if . Let us mention that in practice behaves well for all ’s, and even moderate values of .
By Lemma 2.4, to estimate the solution pair it is sufficient to first solve (7), and then insert the computed solution into (6). Since the fidelity term is smooth and the regularization term non-convex, the common approach is to use iterative thresholding through a forward-backward splitting algorithm [9, 4]. For and the augmented problem (7), the resulting thresholding iterations applied are readily written as
[TABLE]
Each iteration in (9) can be viewed as a thresholded Landweber iteration; we first perform a step in the direction of the negative gradient of the data fidelity term, and then apply the proximal operator of the remaining non-convex term.
The proximal operator of a function is defined by
[TABLE]
where . For separable mappings (10) can be applied component-wise, and we have In the general case, the proximal operator (10) could be set-valued, since there might be multiple or even no minima. It can be shown though that for the (one-dimensional) proximal operator of satisfies
[TABLE]
The range of is where , see [9, Lemma 5.1], and it is discontinuous with a jump discontinuity222While the actual proximal operator of is set-valued and simultaneously assumes both possible values at , we follow common practice when restricting the operator to zero at to have a single-valued function. at . Note that the proximal operators in (11) are indeed thresholding operators, and as goes from [math] to they interpolate between hard- and soft-thresholding operators. Moreover, a closed form of the operator is known only in special cases, namely for and [26].
It follows easily that if the step-size is small enough (smaller than ), the difference of iterates in (9) decreases, i.e. as , see [9, Proposition 2.1]. Note that the iterations in (9) are quite different from those given by alternating minimization, where for each we need to compute through iterative thresholding. The following lemma makes this more precise; it shows that (9) is equivalent to performing only the first step of iterative thresholding when computing in (3). The proof can be found in Section A.2.
Lemma 2.6**.**
The iterations defined in (9) can be rewritten as
[TABLE]
which corresponds to a single proximal gradient descent step of (3) starting at .
2.1.1 Linear Convergence
We now show that the iterates in (9) converge at a linear rate to stationary points of , i.e. points such that , and characterize the convergence constant in dependence of design parameters. Let us emphasize that since our analysis is tailored to -regularization we derive more explicit guarantees (in terms of the involved parameters) than what would follow by directly applying the more general statements of [28] to the augmented formulation (7). The proof can be found in Section A.3.
Theorem 2.7**.**
Let and . Assume the matrix has RIP of order with a constant , and let the stepsize satisfy . Moreover, assume333The sequence converges provably to a stationary point since is among other things coercive and has the KL-property, cf. [5, Theorem 5.1]. The assumption thus is not about whether converges but about the specific limit point which mainly depends on the concrete choice of initialization. is such that and the iterates (9) satisfy . Define and . Then there exists such that for all we have
[TABLE]
Remark 2.8**.**
- (i)
To have linear convergence in Theorem 2.7, we have to choose such that
[TABLE]
This resembles basic assumptions of the main result in **[28]**. One should thus interpret Theorem 2.7 as an additional refinement, better capable of predicting numerical behavior. 2. (ii)
Theorem 2.7 suggests that the convergence constant depends on the sparsity of the signal and properties of . Namely, if the signal is sparser (and thus smaller) then the convergence constant decreases. Similarly, the constant decreases if we increase the number of measurements. 3. (iii)
Assuming , for , it is straight-forward to check that the rate in Theorem 2.7 becomes minimal by choosing . In this case the result transforms into
[TABLE] 4. (iv)
Since and control the strength of regularization in , their choice depends on the expected noise level. Consequently, when setting and one needs to make a trade-off between their regularizing effect and the desired convergence speed.
2.1.2 Computational Complexity
Once has been computed, executing (9) for a constant number of iterations costs operations: for matrix-vector products and for evaluating the proximal operator. But this gets dominated by the operations needed to obtain , which involve a matrix square root and a matrix-matrix linear system and have to be done in advance. This turns out to be a computational bottleneck as soon as as it requires operations, where depends on the used algorithmic method [11]. Such a computational cost can be prohibitive for high-dimensional applications.
2.2 Infimal Convolution Formulation
To overcome the computational limitations observed above, we consider an alternative approach. Define a new program by
[TABLE]
where the infimal convolution is given by
[TABLE]
For a detailed treatment of infimal convolution and its properties, see [6]. It is straight-forward to check that an equivalence between minimizing (4) and (13) holds.
Lemma 2.9**.**
The pair minimizes in (4) if and only if solves (13) while attains the infimal value of .
In order to solve (13) via iterative thresholding (i.e. proximal gradient descent), we need to efficiently evaluate the proximal operator of (14). A helpful observation is that (14) can be interpreted as the Moreau-envelope of , which for a function and is defined as
[TABLE]
where the last equality only holds if . It has been observed in [7, Theorem 6.63] that computing the proximal operator of the Moreau envelope reduces to computing the proximal operator of the underlying function. Though stated only for convex functions in [7], it is straight-forward to generalize the result.
Lemma 2.10**.**
Let be a lower semi-continuous function with . Then,
[TABLE]
The proof is in Section A.4. Define now the proximal gradient descent for (13) by
[TABLE]
We denote by the sequence of minimizers attaining , and set . Note that with this notation and can also be characterized via
[TABLE]
Unlike (15), the representation in (16) does not yield a practically viable algorithm, since and are not decoupled. It does though lend itself to theoretical analysis of the iterations, cf. Section A.5.
2.2.1 Linear Convergence
Though in (14) is continuous and separable, i.e. , it is not continuously differentiable, such that we cannot apply [28] to deduce linear convergence of (15). Nevertheless, using the KKT-conditions of the objective functions in (16), we get linear convergence of the iterates in (15) by a similar strategy as in Theorem 2.7.
Theorem 2.11**.**
Let and . Assume444Along the lines of Footnote 3 in Theorem 2.7. Just note that in (14) has the KL-property by [27, Theorem 3.1] and, hence, the objective function in (13) has it as well. that and . Let denote the support of and define . Then there exists such that for all we have
[TABLE]
The proof of Theorem 2.11 is given in Section A.5.
Remark 2.12**.**
On the one hand, in Theorem 2.11 the assumption on and the rate differ from Theorem 2.7; there is no influence of on admissible step-sizes and the rate is split in two distinct components. On the other hand, since, for ,
[TABLE]
the rate in Theorem 2.11 suggests to choose large to dominate the second term of the rate in which case the assumptions on agree in both theorems. Moreover, this reduces the rate to
[TABLE]
where the denominator is as in Theorem 2.7. In light of (17), we get linear convergence of (15) if
[TABLE]
As already discussed in Remark 2.8, a trade-off between regularization and convergence rate has to be taken into account when choosing and .
Remark 2.13**.**
For , an alternative viewpoint on (16) is given by
[TABLE]
where we used [22, Eq. (3.3)] in the last step, meaning that
[TABLE]
is a proximal gradient descent sequence of , the squared -norm of the gradient of the smooth Moreau approximation of . From this perspective, multi-penalty regularization resembles a Newton-type method by searching for zeros of the derivative of a smooth approximation of the -norm. However, transferring this intuition to the case is non-trivial. On a technical level the equations in (18) break down in the third line, which does not hold for due to non-convexity of .
2.2.2 Computational Complexity
While (9) requires computing , which can be costly, the infimal convolution formulation (15) does not incur additional computational costs and thus directly inherits efficiency and linear convergence of the proximal descent method. Indeed, for a fixed number of iterations the number of operations performed in (15) is (the additional convex combination when evaluating the proximal operator by Lemma 2.10 is negligible). This is considerably lower than , for , which is the computational cost of the augmented formulation, particularly if is large. In numerical simulations, this effect is easy to observe, cf. Section 3.
3 Numerical Experiments
We now present experimental results that focus on two aspects of our study. First, we examine the convergence rate of the proposed algorithms, confirming linear convergence and in case of the augmented formulation, the dependence of the convergence constant on the parameters of the problem. Second, we examine their efficiency by studying the overall computational effort on larger scale problems.
3.1 Convergence Rate
Via the RIP-constant Theorem 2.7 gives a direct dependence of the convergence rate on the sparsity of the solution and the properties of the matrix, whereas Theorem 2.11 is harder to interpret: it is straight-forward to deduce the existence of parameter regimes in which linear convergence occurs but hard to quantify the rate in terms of the parameters. While numerical evidence for linear convergence of the infimal convolution formulation is observed in Section 3.2, we continue by validating Theorem 2.7 in two experiments. In both, we take , and add pre- and post-measurement Gaussian noise terms, and , with noise level . We choose an admissible according to Remark 2.8 and tune it such that the reconstructed signal shares its support size with the ground-truth. Both illustrations in Figure 1 plot the relative error between the iterates and the stationary point against the number of proximal gradient descent steps.
Varying the Penalty Parameter.
In the first experiment we take a Gaussian matrix , a -sparse signal , and vary . Theorem 2.7 predicts that smaller values of allow to take larger stepsizes, though the convergence constants are (essentially) the same. This effect is readily observed in Figure 1(a). Note that we can also observe that for smaller the algorithm reaches the steep part of the curve faster. This is due to the fact that the convergence of iterates is initially slow (until the support is identified) and larger step-sizes allow to reduce the support size faster. The overall speed-up allowed by a smaller can be by up to a two-fold, in terms of the number of iterations needed to reach the desired accuracy level.
Varying the Measurements.
In the second experiment we consider a Gaussian matrix , for , and a -sparse signal . Varying the number of measurements changes the RIP of the measurement matrix (a larger decreases , see Remark 2.3), and per Theorem 2.7 should affect the convergence constant. Figure 1(b) shows exactly that. An analogous effect can be observed for different classes of measurement matrices, such as partial Toeplitz, or partial circulant matrices with Rademacher or Gaussian entries, but those results have not been included for the sake of brevity.
3.2 Computational Comparison
Iteration Count.
In order to provide numerical evidence for our initial statement that alternating minimization is highly sub-optimal, in Figure 2(a) we look at the decay of the relative error over the number of basic iterations, i.e. the number of thresholded gradient descent steps, of all three discussed approaches: alternating minimization (3), augmented formulation (9), and infimal convolution (15). In this experiment, we use a Gaussian matrix , the original signal is -sparse, and the parameter , , and are selected so that each method returns a -sparse vector. The -axis refers to the number of times the proximal operator is called while the -axis shows the relative error. The considerably worse performance of alternating minimization is due to the fact that it requires (too) many thresholded gradient steps to solve, for each , sub-problems for the component up to pre-fixed accuracy . Thus, the algorithm performs hardly any alternating steps.
Computation Time.
To now illustrate the differences between augmented and infimal convolution formulation in terms of computational complexity, we perform the following experiment. We set the parameters generically to , , and , and reconstruct a -sparse signal from measurements , for varying from (sub-sampling) to (over-sampling). We again take , and add pre- and post-measurement noise terms, and , with noise level . Averaging over random realizations of , we record for augmented (9) and infimal convolution approach (15) the time needed to perform iterations. After such few iterations none of the two algorithms has converged, though this already suffices to make a point regarding the computational cost since both algorithms incur the same cost (i.e. the gap remains the same) in the remaining iterations. As Figure 2(b) shows, the additional computation of in (9) causes a massive additional workload leading to limited applicability of the augmented approach in large-scale settings. In contrast, the infimal convolution formulation is hardly affected by the increase in the number of measurements. Though the augmented approach tends to converge in fewer iterations, cf. Figure 2(a), the additional iterations needed by the infimal convolution formulation to reach a comparable level of accuracy do not close the gap in computation time. Note that we do not include alternating minimization here since it requires many more iterations (in the sense of single thresholded gradient descent steps) to show similar reconstruction performance as both proximal descents, and hence could not compete with those two algorithms.
4 Discussion
In the present work we discussed the benefits of multi-penalty regularization for support recovery of signals when pre-measurement noise is amplified by the measurement operator and numerical challenges in solving the corresponding variational formulation. Since alternating minimization is for this task sub-optimal in terms of both the computational efficiency and theoretical analysis, we proposed a novel reduction to single-penalty regularization based on infimal convolution, and compared this new approach to an existing reduction based on augmented formulations. Moreover, we established linear convergence for both single-penalty reductions and showed that our new approach omits a computational bottleneck that is unavoidable in the augmented approach, and causes a significant additional computational workload if the number of measurements increases. There are several interesting open questions left for future work.
First, in Remark 2.13 we observed, for , a connection between the infimal convolution formulation and the proximal descent on the -norm of the gradient of a Moreau-regularized -functional. As we have not seen a comparable relation in the context of multi-penalty regularization so far, we are curious whether this observation can be extended to the case . If so, this might provide valuable insights into non-convex optimization.
Second, as the reader might have noticed, great parts of the arguments we used (support stabilization, sign stabilization, etc.) are not restricted to finite dimensions. In light of more general settings of multi-penalty regularization in [21] and single-penalty regularization in [9], it would be fruitful to transfer our findings to general separable Hilbert spaces as well.
Third, we mention that when using the infimal convolution based approach, in some experiments it was possible to choose much larger than suggested by Theorem 2.11, while still observing reliable convergence of the program. We wonder whether there is an alternative proof leading to a relaxed condition on resembling the assumption in Theorem 2.7.
Let us conclude by emphasizing that the infimal convolution formulation can as well be applied if regularizers other than the -norm are used in the multi-penalty problem, e.g. Smoothly Clipped Absolute Deviation (SCAD) [13], Minimax Concave Penalty (MCP) [29], and Log-Sum Penalty (LSP) [10]. In those cases the more general single-penalty rate analysis in [28] should prove useful as a tool.
Acknowledgment
ZK and VN acknowledge the support from RCN-funded FunDaHD project No 251149/O70. JM acknowledges the support of DFG-SPP 1798.
Appendix A Proofs
A.1 Proof of Lemma 2.4
For a fixed the minimization of in (4) with respect to reduces to Tikhonov minimization, and thus the solution satisfies
[TABLE]
Rewriting the above expression we have
[TABLE]
Plugging this expression into (4) the minimization problem for is rewritten as
[TABLE]
The Woodbury identity for invertible matrices , and matrices , reads
[TABLE]
Using (19), this gives
[TABLE]
Plugging this expression back into , and extracting the square root, we have . Minimizing over and using the following simple observation gives the conclusion.
Lemma A.1**.**
If is a local minimizer of (7), then the pair with defined in (6), is a local minimizer of in (4).
Proof.
Let be a local minimizer of and assume there exists a sequence such that , for all . We then have
[TABLE]
where the first inequality follows from the minimality of . This contradicts the assumption that is a local minimizer of . ∎
A.2 Proof of Lemma 2.6
First note that
[TABLE]
while
[TABLE]
Hence, it suffices to show that
[TABLE]
Extracting from the left and using the Woodbury identity (20) with , , and the conclusion follows.
A.3 Proof of Theorem 2.7
In order to prove Theorem 2.7, we have to control the eigenvalues of characterizing the growth of the data fidelity term in (7).
Lemma A.2**.**
For defined as in Lemma 2.4,
[TABLE]
is the Lipschitz-constant of the gradient of the augmented data-fidelity term . Moreover, for any ,
[TABLE]
Proof.
Let denote the SVD of . This gives
[TABLE]
so that By (21), we have for any
[TABLE]
implying the second claim. ∎
We can now show that all, up to finitely many, iterates generated by (9) share the same support and sign pattern. The proof is standard and follows [9].
Lemma A.3** (Support and sign recovery).**
Assume , , and . Then the iterates satisfy as . Moreover, all iterates, up to finitely many, have the same support and sign pattern.
Proof.
Since we have as by [9, Corollary 2.1]. Now, since the range of is , it follows that the the absolute value of a non-zero entry of , for , is at least . Thus, if we have , and analogously, if we have . Thus, since as , sign and support can change only finitely many times. ∎
Proof of Theorem 2.7.
By Lemma A.3 there exists such that for all the support of is finite, and support and sign of is equal to that of . Thus, by [9, Proposition 2.3], is a fixed point of (9). Denote with . The definition of proximal operator in (10) and the Karush-Kuhn-Tucker (KKT) conditions yield
[TABLE]
and
[TABLE]
Subtracting the two equations on the index set , and denoting , we have
[TABLE]
where is acting entry-wise. Note that since we have and . A straightforward calculation gives
[TABLE]
where . Taking the inner product of (23) with , and applying the Cauchy-Schwartz inequality, we get
[TABLE]
Since is twice differentiable, and and have the same sign and support, we have for the second term
[TABLE]
where lies between and , and . Since , we may assume sufficiently large to guarantee , for all and . Consequently,
[TABLE]
Thus,
[TABLE]
On the other hand, since , we have
[TABLE]
by Lemma A.2. Thus,
[TABLE]
Together with the RIP of this yields the claim. ∎
A.4 Proof of Lemma 2.10
Let be fixed and assume without loss of generality. We have
[TABLE]
By being lower semi-continuous and bounded from below, we have
[TABLE]
implying . Denote by the line connecting and . Since is convex, we have , for any , with equality if and only if . Consequently, if solves the above program, we have for some . Let us define
[TABLE]
By the above considerations we have
[TABLE]
where there is a one-to-one correspondence between solutions of the left side and solutions . Moreover, it follows easily that for fixed,
[TABLE]
which is independent of . Thus, the claim follows since
[TABLE]
A.5 Proof of Theorem 2.11
As in the proof of Theorem 2.7, the first step is to control support and signs of the iterates. Recall that, for as in (15), we denote by the sequence of minimizers attaining , by , and that by (16) we have
[TABLE]
Lemma A.4** (Sign and support stability).**
Assume . Then the successive iterates , , and converge to zero and all but finitely many iterates share the same finite support and the same signs.
Proof.
First, note that is a proper and coercive function. Second, as , for continuous, we obtain continuity of at any point since by coercivity of the infimum can be restricted to a finite ball and the infimum of continuous functions on a compact set is continuous. Consequently, by [9, Corollary 2.1] and the assumption on we have , for . By the KKT-conditions of (24), we obtain
[TABLE]
Subtracting the two equations gives , and yields . The second claim follows as in Lemma A.3, since is a thresholded version of . ∎
Proof of Theorem 2.11.
First note that implies via Lemma A.4 that and . Furthermore, is a fixed point of (15), by [9, Proposition 2.3]. By Lemma A.4 there exists such that for all the support of is finite, and support and sign of is equal to that of . Denote . By the KKT-conditions of (24), we get
[TABLE]
and
[TABLE]
For with acting entry-wise, this implies
[TABLE]
and
[TABLE]
Repeating the steps as in Theorem 2.7, from (25) we get
[TABLE]
and from (26) we obtain
[TABLE]
Squaring and summing the last two equations, the claim follows by orthogonality of and . ∎
Appendix B Coherence Bound
The following Lemma bounds the coherence of in terms of the coherence of . The bound becomes tight for large choices of .
Lemma B.1**.**
We have
[TABLE]
Proof.
Recall that the coherence of a matrix is defined as
[TABLE]
where is the -th column of . Define , so that , and let be the SVD of . This gives
[TABLE]
Therefore,
[TABLE]
for , and by triangle inequality and Cauchy-Schwarz
[TABLE]
for all columns of . By the same argument we compute
[TABLE]
giving
[TABLE]
This yields
[TABLE]
which implies
[TABLE]
∎
For small , the bound in Lemma B.1 is lossy. However, we can show that the coherence of converges to the coherence of a conditioned version of , for .
Lemma B.2**.**
Let , for , have full rank. We have , for .
Proof.
Define , so that , and let be the SVD of . Define with columns . First note, that
[TABLE]
and
[TABLE]
Consequently,
[TABLE]
and
[TABLE]
Since we have in addition that , , and , we get
[TABLE]
We conclude by noting that . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEE Transactions on Information Theory , 56(10):5111–5130, 2010.
- 2[2] E. Arias-Castro and Y. C. Eldar. Noise folding in compressed sensing. IEEE Signal Processing Letters , 18(8):478–481, 2011.
- 3[3] M. Artina, M. Fornasier, and S. Peter. Damping noise-folding and enhanced support recovery in compressed sensing. IEEE Transactions on Signal Processing , 63(22):5990–6002, 2015.
- 4[4] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research , 35(2):438–457, 2010.
- 5[5] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming , 137(1):91–129, 2013.
- 6[6] H. H. Bauschke, P. L. Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces , volume 408. Springer, 2011.
- 7[7] A. Beck. First-order methods in optimization . SIAM, 2017.
- 8[8] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming , 165(2):471–507, 2017.
