Computational approaches to non-convex, sparsity-inducing multi-penalty   regularization

Zeljko Kereta; Johannes Maly; and Valeriya Naumova

arXiv:1908.02503·cs.IT·January 15, 2021

Computational approaches to non-convex, sparsity-inducing multi-penalty regularization

Zeljko Kereta, Johannes Maly, and Valeriya Naumova

PDF

TL;DR

This paper investigates efficient algorithms for non-convex multi-penalty regularization in sparse signal reconstruction, introducing a new infimal convolution approach with proven linear convergence and validated by numerical experiments.

Contribution

It extends existing methods to non-convex settings, proposes a computationally efficient infimal convolution approach, and provides convergence analysis with numerical validation.

Findings

01

Both approaches achieve linear convergence rates.

02

The infimal convolution method is less dependent on problem size.

03

Numerical experiments confirm theoretical convergence rates.

Abstract

In this work we consider numerical efficiency and convergence rates for solvers of non-convex multi-penalty formulations when reconstructing sparse signals from noisy linear measurements. We extend an existing approach, based on reduction to an augmented single-penalty formulation, to the non-convex setting and discuss its computational intractability in large-scale applications. To circumvent this limitation, we propose an alternative single-penalty reduction based on infimal convolution that shares the benefits of the augmented approach but is computationally less dependent on the problem size. We provide linear convergence rates for both approaches, and their dependence on design parameters. Numerical experiments substantiate our theoretical findings.

Equations182

A (u^{†} + v) + ξ = y,

A (u^{†} + v) + ξ = y,

u, v \in R^{n} min \frac{1}{2} ∥ A (u + v) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{p} ∥ v ∥_{p}^{p},

u, v \in R^{n} min \frac{1}{2} ∥ A (u + v) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{p} ∥ v ∥_{p}^{p},

u^{k + 1} v^{k + 1} \in u \in R^{n} argmin \frac{1}{2} ∥ A (u + v^{k}) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q}, \in v \in R^{n} argmin \frac{1}{2} ∥ A (u^{k + 1} + v) - y ∥_{2}^{2} + \frac{β}{p} ∥ v ∥_{p}^{p} .

u^{k + 1} v^{k + 1} \in u \in R^{n} argmin \frac{1}{2} ∥ A (u + v^{k}) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q}, \in v \in R^{n} argmin \frac{1}{2} ∥ A (u^{k + 1} + v) - y ∥_{2}^{2} + \frac{β}{p} ∥ v ∥_{p}^{p} .

supp (u) = {i \in [n] : u_{i} \neq = 0}

supp (u) = {i \in [n] : u_{i} \neq = 0}

sgn (u) = ⎩ ⎨ ⎧ 1, 0, - 1, if u > 0, if u = 0, if u < 0.

sgn (u) = ⎩ ⎨ ⎧ 1, 0, - 1, if u > 0, if u = 0, if u < 0.

T_{α, β}^{q} (u, v) := \frac{1}{2} ∥ A (u + v) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{2} ∥ v ∥_{2}^{2},

T_{α, β}^{q} (u, v) := \frac{1}{2} ∥ A (u + v) - y ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{2} ∥ v ∥_{2}^{2},

(u_{α, β}^{q}, v_{α, β}^{q}) \in u, v \in R^{n} argmin T_{α, β}^{q} (u, v) .

(u_{α, β}^{q}, v_{α, β}^{q}) \in u, v \in R^{n} argmin T_{α, β}^{q} (u, v) .

φ^{'} (f (x) - f (\overset{ˉ}{x})) dist (0, \partial f (x)) \geq 1.

φ^{'} (f (x) - f (\overset{ˉ}{x})) dist (0, \partial f (x)) \geq 1.

(1 - δ_{s}) ∥ u ∥_{2} \leq ∥ A u ∥_{2} \leq (1 + δ_{s}) ∥ u ∥_{2} .

(1 - δ_{s}) ∥ u ∥_{2} \leq ∥ A u ∥_{2} \leq (1 + δ_{s}) ∥ u ∥_{2} .

m \geq C δ_{s}^{- 2} s lo g (\frac{e n}{s})

m \geq C δ_{s}^{- 2} s lo g (\frac{e n}{s})

v_{α, β}^{q} = v (u_{α, β}^{q}) = (β I d_{n} + A^{⊤} A)^{- 1} (A^{⊤} y - A^{⊤} A u_{α, β}^{q}),

v_{α, β}^{q} = v (u_{α, β}^{q}) = (β I d_{n} + A^{⊤} A)^{- 1} (A^{⊤} y - A^{⊤} A u_{α, β}^{q}),

u_{α, β}^{q} \in u \in R^{n} argmin F_{β} (u), F_{β} (u) := \frac{1}{2} ∥ B_{β} u - y_{β} ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q},

u_{α, β}^{q} \in u \in R^{n} argmin F_{β} (u), F_{β} (u) := \frac{1}{2} ∥ B_{β} u - y_{β} ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q},

B_{β} = (I d_{m} + \frac{A A ^{⊤}}{β})^{- 1/2} A and y_{β} = (I d_{m} + \frac{A A ^{⊤}}{β})^{- 1/2} y .

B_{β} = (I d_{m} + \frac{A A ^{⊤}}{β})^{- 1/2} A and y_{β} = (I d_{m} + \frac{A A ^{⊤}}{β})^{- 1/2} y .

coh (M) = i \neq = j max \frac{m _{i}^{⊤} m _{j}}{∥ m _{i} ∥ _{2} ∥ m _{j} ∥ _{2}},

coh (M) = i \neq = j max \frac{m _{i}^{⊤} m _{j}}{∥ m _{i} ∥ _{2} ∥ m _{j} ∥ _{2}},

coh (B_{β}) \leq (1 + \frac{∥ A ∥ ^{2}}{β}) (coh (A) + \frac{∥ A ∥ ^{2}}{β}) .

coh (B_{β}) \leq (1 + \frac{∥ A ∥ ^{2}}{β}) (coh (A) + \frac{∥ A ∥ ^{2}}{β}) .

{Set the initial vector u^{0} u^{k + 1} = prox_{μ, \frac{α}{q} ∥ \cdot ∥_{q}^{q}} (u^{k} - μ B_{β}^{⊤} (B_{β} u^{k} - y_{β})) .

{Set the initial vector u^{0} u^{k + 1} = prox_{μ, \frac{α}{q} ∥ \cdot ∥_{q}^{q}} (u^{k} - μ B_{β}^{⊤} (B_{β} u^{k} - y_{β})) .

prox_{μ, ν Ψ} (u) = z \in R^{n} argmin \frac{1}{2 μ} ∥ z - u ∥_{2}^{2} + ν Ψ (z),

prox_{μ, ν Ψ} (u) = z \in R^{n} argmin \frac{1}{2 μ} ∥ z - u ∥_{2}^{2} + ν Ψ (z),

prox_{μ, ν ∣ \cdot ∣^{q}} (u)

prox_{μ, ν ∣ \cdot ∣^{q}} (u)

where τ_{μ}

u^{k + 1} = prox_{μ, \frac{α}{q} ∥ \cdot ∥_{q}^{q}} (u^{k} - μ A^{⊤} (A u^{k} + A v (u^{k}) - y)),

u^{k + 1} = prox_{μ, \frac{α}{q} ∥ \cdot ∥_{q}^{q}} (u^{k} - μ A^{⊤} (A u^{k} + A v (u^{k}) - y)),

∥ u^{k + 1} - u^{⋆} ∥_{2} \leq \frac{1 - μ ( 1 + \frac{∥ A ∥ ^{2}}{β} ) ^{- 1} ( 1 - δ _{s} ) ^{2}}{1 - μα ( 1 - q ) ( \frac{d _{m i n}}{2} ) ^{q - 2}} ∥ u^{k} - u^{⋆} ∥_{2} .

∥ u^{k + 1} - u^{⋆} ∥_{2} \leq \frac{1 - μ ( 1 + \frac{∥ A ∥ ^{2}}{β} ) ^{- 1} ( 1 - δ _{s} ) ^{2}}{1 - μα ( 1 - q ) ( \frac{d _{m i n}}{2} ) ^{q - 2}} ∥ u^{k} - u^{⋆} ∥_{2} .

0 < α < α^{⋆} = (1 + \frac{∥ A ∥ ^{2}}{β})^{- 1} \frac{( 1 - δ _{s} ) ^{2}}{( 1 - q )} (\frac{d _{m i n}}{2})^{2 - q} .

0 < α < α^{⋆} = (1 + \frac{∥ A ∥ ^{2}}{β})^{- 1} \frac{( 1 - δ _{s} ) ^{2}}{( 1 - q )} (\frac{d _{m i n}}{2})^{2 - q} .

∥ u^{k + 1} - u^{⋆} ∥_{2} \leq \frac{1 - ∥ A ∥ ^{- 2} ( 1 - δ _{s} ) ^{2}}{1 - c ∥ A ∥ ^{- 2} ( 1 - δ _{s} ) ^{2}} ∥ u^{k} - u^{⋆} ∥_{2} .

∥ u^{k + 1} - u^{⋆} ∥_{2} \leq \frac{1 - ∥ A ∥ ^{- 2} ( 1 - δ _{s} ) ^{2}}{1 - c ∥ A ∥ ^{- 2} ( 1 - δ _{s} ) ^{2}} ∥ u^{k} - u^{⋆} ∥_{2} .

w_{α, β}^{q} = w \in R^{n} argmin \frac{1}{2} ∥ A w - y ∥_{2}^{2} + (\frac{α}{q} ∥ \cdot ∥_{q}^{q} Δ \frac{β}{2} ∥ \cdot ∥_{2}^{2}) (w),

w_{α, β}^{q} = w \in R^{n} argmin \frac{1}{2} ∥ A w - y ∥_{2}^{2} + (\frac{α}{q} ∥ \cdot ∥_{q}^{q} Δ \frac{β}{2} ∥ \cdot ∥_{2}^{2}) (w),

g (w) := (\frac{α}{q} ∥ \cdot ∥_{q}^{q} Δ \frac{β}{2} ∥ \cdot ∥_{2}^{2}) (w) = u \in R^{n} in f \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{2} ∥ w - u ∥_{2}^{2} .

g (w) := (\frac{α}{q} ∥ \cdot ∥_{q}^{q} Δ \frac{β}{2} ∥ \cdot ∥_{2}^{2}) (w) = u \in R^{n} in f \frac{α}{q} ∥ u ∥_{q}^{q} + \frac{β}{2} ∥ w - u ∥_{2}^{2} .

M_{t, f} (x) = (f Δ \frac{1}{2 t} ∥ \cdot ∥_{2}^{2}) (x) = f (prox_{t, f} (x)) + \frac{1}{2 t} ∥ x - prox_{t, f} (x) ∥_{2}^{2},

M_{t, f} (x) = (f Δ \frac{1}{2 t} ∥ \cdot ∥_{2}^{2}) (x) = f (prox_{t, f} (x)) + \frac{1}{2 t} ∥ x - prox_{t, f} (x) ∥_{2}^{2},

prox_{μ, λ M_{t, f}} (x) = \frac{t}{t + μ λ} x + \frac{μ λ}{t + μ λ} prox_{(t + μ λ), f} (x) .

prox_{μ, λ M_{t, f}} (x) = \frac{t}{t + μ λ} x + \frac{μ λ}{t + μ λ} prox_{(t + μ λ), f} (x) .

{Set the initial vector w^{0} w^{k + 1} = prox_{μ, g} (w^{k} - μ A^{⊤} (A w^{k} - y)) .

{Set the initial vector w^{0} w^{k + 1} = prox_{μ, g} (w^{k} - μ A^{⊤} (A w^{k} - y)) .

{w^{k} = argmin_{w \in R^{n}} \frac{1}{2 μ} ∥ w - w^{k - 1} + μ A^{⊤} (A w^{k - 1} - y) ∥_{2}^{2} + \frac{β}{2} ∥ w - u^{k} ∥_{2}^{2} u^{k} = argmin_{u \in R^{n}} \frac{β}{2} ∥ u - w^{k} ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} .

{w^{k} = argmin_{w \in R^{n}} \frac{1}{2 μ} ∥ w - w^{k - 1} + μ A^{⊤} (A w^{k - 1} - y) ∥_{2}^{2} + \frac{β}{2} ∥ w - u^{k} ∥_{2}^{2} u^{k} = argmin_{u \in R^{n}} \frac{β}{2} ∥ u - w^{k} ∥_{2}^{2} + \frac{α}{q} ∥ u ∥_{q}^{q} .

\displaystyle\|{\mathbf{w}}^{k+1}-{\mathbf{w}}^{\star}\|_{2}\leq\left(\frac{\|{\mathbf{P}}_{I}-\mu{\mathbf{A}}_{I}^{\top}{\mathbf{A}}\|^{2}}{\Big{(}{1-\alpha\mu(1-q)\left(\frac{{d_{\min}}}{2}\right)^{q-2}\Big{)}^{2}}}+\frac{\|{\mathbf{P}}_{I^{c}}-\mu{\mathbf{A}}_{I^{c}}^{\top}{\mathbf{A}}\|^{2}}{(1+\mu\beta)^{2}}\right)^{1/2}\;\|{\mathbf{w}}^{k}-{\mathbf{w}}^{\star}\|_{2}

\displaystyle\|{\mathbf{w}}^{k+1}-{\mathbf{w}}^{\star}\|_{2}\leq\left(\frac{\|{\mathbf{P}}_{I}-\mu{\mathbf{A}}_{I}^{\top}{\mathbf{A}}\|^{2}}{\Big{(}{1-\alpha\mu(1-q)\left(\frac{{d_{\min}}}{2}\right)^{q-2}\Big{)}^{2}}}+\frac{\|{\mathbf{P}}_{I^{c}}-\mu{\mathbf{A}}_{I^{c}}^{\top}{\mathbf{A}}\|^{2}}{(1+\mu\beta)^{2}}\right)^{1/2}\;\|{\mathbf{w}}^{k}-{\mathbf{w}}^{\star}\|_{2}

∥ P_{I} - μ A_{I}^{⊤} A ∥ ∥ P_{I^{c}} - μ A_{I^{c}}^{⊤} A ∥ = ∥ P_{I} (I d_{n} - μ A^{⊤} A) ∥ \leq ∥ I d_{n} - μ A^{⊤} A ∥ < 1 and = ∥ P_{I^{c}} (I d_{n} - μ A^{⊤} A) ∥ \leq ∥ I d_{n} - μ A^{⊤} A ∥ < 1,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Computational approaches to non-convex, sparsity-inducing multi-penalty regularization

Željko Kereta Email: [email protected] University College London, United Kingdom

Simula Research Laboratory, Simula Metropolitan Center for Digital Engineering, Norway

Johannes Maly Email: [email protected] KU Eichstaett/Ingolstadt, Germany

Valeriya Naumova Email: [email protected] Simula Research Laboratory, Simula Metropolitan Center for Digital Engineering, Norway

Abstract

In this work we consider numerical efficiency and convergence rates for solvers of non-convex multi-penalty formulations when reconstructing sparse signals from noisy linear measurements. We extend an existing approach, based on reduction to an augmented single-penalty formulation, to the non-convex setting and discuss its computational intractability in large-scale applications. To circumvent this limitation, we propose an alternative single-penalty reduction based on infimal convolution that shares the benefits of the augmented approach but is computationally less dependent on the problem size. We provide linear convergence rates for both approaches, and their dependence on design parameters. Numerical experiments substantiate our theoretical findings.

1 Introduction

In many real-life applications one is interested in recovering a structured signal from few corrupted linear measurements. One particular challenge lies in separating the ground-truth from pre-measurement noise since any such corruption is amplified during the measurement process, a phenomenon known as noise folding [2] or input noise model [1]. It commonly appears in signal processing and compressed sensing applications, where noise is added to the signal both before and after the measurement process occurs. This can be modeled as

[TABLE]

where ${\mathbf{u}}^{\dagger}\in\mathbb{R}^{n}$ is an $s$ -sparse original signal that we want to recover, ${\mathbf{v}}\in\mathbb{R}^{n}$ is the pre-measurement noise, $\boldsymbol{\mathrm{\xi}}\in\mathbb{R}^{m}$ is the post-measurement noise, and ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ is the measurement matrix. Note that a signal ${\mathbf{u}}\in\mathbb{R}^{n}$ is called $s$ -sparse if its support consists of at most $s$ elements, i.e. $|\mathrm{supp}({\mathbf{u}})|=|\left\{i\colon\mathsf{u}_{i}\neq 0\right\}|\leq s$ . Information theoretic bounds state that the number of measurements $m$ required for the exact support recovery of ${\mathbf{u}}^{\dagger}$ from (1) needs to scale linearly111Assume for simplicity ${\mathbf{v}}{\perp\!\!\!\perp}\boldsymbol{\mathrm{\xi}}$ , $\boldsymbol{\mathrm{\xi}}\sim{\cal N}(0,\sigma^{2}\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{n})$ , and ${\mathbf{v}}\sim{\cal N}(0,\sigma_{v}^{2}\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{n})$ . We now write (1) as ${\mathbf{y}}={\mathbf{A}}{\mathbf{u}}^{\dagger}+{\mathbf{w}}$ , where ${\mathbf{w}}:={\mathbf{A}}{\mathbf{v}}+\boldsymbol{\mathrm{\xi}}$ represents the effective noise. The covariance matrix of ${\mathbf{w}}$ equals $\sigma^{2}\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}+\sigma_{v}^{2}{\mathbf{A}}{\mathbf{A}}^{\top}=:{\mathbf{Q}}$ . Assuming ${\mathbf{A}}{\mathbf{A}}^{\top}\approx\frac{n}{m}\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}$ (as is the case, with high probability, for ${\mathbf{A}}$ with zero mean, $1/m$ -variance sub-Gaussian entries), and $\sigma_{v}\approx\sigma$ , we would have ${\mathbf{Q}}=\sigma^{2}(1+C\frac{n}{m})\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}$ , for $C>0$ . Thus, the variance of the noise rises by a factor proportional to $n/m$ , which when $m\ll n$ can be substantial. with $n$ , which leads to poor compression performance [1].

A number of recent studies [3, 21, 16, 15] try and mitigate these issues through a multi-penalty regularization framework defined as

[TABLE]

where $\alpha,\beta>0$ are regularization parameters, $0\leq q<2$ , and $2\leq p<\infty$ . In particular, to promote sparsity of the ${\mathbf{u}}$ component we choose $q\leq 1$ . A natural way to minimize (2) is via alternating minimization, starting from ${\mathbf{u}}^{0},{\mathbf{v}}^{0}\in\mathbb{R}^{n}$ and then iterating as

[TABLE]

Whereas the second problem is differentiable and admits an explicit solution, the first problem requires iterative thresholding for $q\leq 1$ [21], for each outer iteration $k\in\mathbb{N}$ , and becomes non-convex if $q<1$ . Moreover, alternating minimization does not lend itself to an easy analysis of the convergence rate.

1.1 Contribution

In this work we examine the multi-penalty problem (2), for the case $0<q\leq 1$ and $p=2$ . We first show that the augmented approach in [16], which allows to decouple the computation of ${\mathbf{u}}$ and ${\mathbf{v}}$ components of the solution, can be easily extended to $q<1$ to obtain an augmented single-penalty iterative thresholding algorithm providing solutions to (2). Since this includes computing the inverse of a possibly high-dimensional matrix, we suggest an alternative single-penalty iterative thresholding algorithm which is based on an infimal convolution formulation of (2) and sidesteps the computational bottleneck of the augmented approach. We show a linear convergence rate for both approaches, in dependence of design parameters, and in numerical simulations confirm both the rate analysis and the efficiency gap. In particular, we argue that the benefits of faster convergence rates are sometimes offset by the computational demands, which suggests that a preferred method for solving the optimization problem can be chosen with respect to the size of ${\mathbf{A}}$ .

1.2 Related Work

In [21] the authors approach (3), for $0<q\leq 1$ and $p=\infty$ , on separable Hilbert spaces by applying iterative thresholding algorithms to each of the sub-problems, and show convergence of the sequence of iterates to stationary points of the underlying problem. The choice $p=\infty$ is of special interest when ${\mathbf{v}}$ models uniform pre-measurement noise. However, the authors also show that $p=2$ exhibits the best (empirical) performance for the reconstruction of ${\mathbf{u}}^{\dagger}$ , for ${\mathbf{v}}$ modelling various common noise types (including uniform noise). It is for this reason that in this paper we are concerned only with the case $p=2$ . We add though that more general noise types might be of interest in very particular cases, and this is a possible topic for future research. In [16] the authors reduce the optimization problem (2) to a single-penalty regularization through an augmented data matrix, for $q=1$ and $p=2$ , and derive conditions on optimal support recovery. The authors provide theoretical and numerical evidence of superior performance of multi-penalty regularization over standard single-penalty approaches for the sparse recovery of solutions to (1). In [15] a principled, data-driven parameter selection approach is derived for $q=1$ and $p=2$ , based on the Lasso path. Instead of through noise folding, a multi-penalty formulation of the objective function can also be seen from the perspective of the recovery of a signal that is a superposition of two components, e.g. a sparse and a smooth component. See [12] and references therein. In spite of these and other advances, rigorous results regarding convergence rate and error analysis for (2) have not been established.

Since we reduce (2) to specific single-penalty problems, corresponding convergence results on classical proximal descent methods are of interest. In [9] important insights on support stability and convergence of iterative thresholding algorithms on separable Hilbert spaces have been collected while [28] proved linear convergence rates of the iterative thresholding algorithm, under certain conditions, if the underlying thresholding operator is not continuous, though the dependency on the parameters of the optimization scheme are not explicitly derived. Linear convergence of a single penalty non-convex regularizer with adaptive thresholding was established in [24], where the influence of the RIP of the design matrix on the convergence constant can be inferred. A further survey of nonconvex regularizers for sparse recovery can be found in [25].

Lastly, approaches representing regularizers as infimal convolution can be found in the context of machine learning and signal processing, cf. [17, 18]. Therein primal-dual schemes are examined for optimizing functionals penalized via infimal convolutions. The results, however, require piece-wise convexity which is not given in our case.

1.3 Notation

We restrict boldface lettering to matrices (uppercase), e.g. ${\mathbf{A}}$ , and vectors (lowercase), e.g. ${\mathbf{u}}$ . The $i^{\text{th}}$ entry of a vector ${\mathbf{u}}$ is denoted as $\mathsf{u}_{i}$ . For $m\in\mathbb{N}$ we denote $[m]:=\{1,\ldots,m\}$ . For $0<q\leq\infty$ the $\ell_{q}$ norm of a vector ${\mathbf{u}}=(\mathsf{u}_{1},\ldots,\mathsf{u}_{n})^{\top}\in\mathbb{R}^{n}$ is denoted by $\left\|{{\mathbf{u}}}\right\|_{q}$ . The support set of ${\mathbf{u}}\in\mathbb{R}^{n}$ is denoted as

[TABLE]

and the sign $\operatorname{sgn}({\mathbf{u}})=(\operatorname{sgn}(\mathsf{u}_{i}))_{i=1}^{n}$ is defined component-wise by

[TABLE]

For a matrix ${\mathbf{M}}\in\mathbb{R}^{m\times n}$ , we use $\left\|{{\mathbf{M}}}\right\|$ to denote its spectral norm and $\lambda_{\min}({\mathbf{M}})$ to denote its smallest singular value. We denote the $n\times n$ identity matrix by $\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{n}$ . For $I\subset[n]$ , ${\mathbf{M}}_{I}\in\mathbb{R}^{m\times\left|{I}\right|}$ represents the submatrix of ${\mathbf{M}}$ containing the columns indexed by $I$ , and ${\mathbf{u}}_{I}\in\mathbb{R}^{\left|I\right|}$ denotes the subvector of ${\mathbf{u}}$ containing the entries restricted to $I$ . We denote the corresponding orthogonal projection operator onto $I$ as ${\mathbf{P}}_{I}\in\mathbb{R}^{\left|I\right|\times n}$ , so that ${\mathbf{P}}_{I}{\mathbf{u}}={\mathbf{u}}_{I}$ . When indexed by a set $T\subset\mathbb{R}^{n}$ , ${\mathbf{P}}_{T}$ denotes the orthogonal projection onto $T$ . Finally, the set-valued operator $\partial$ denotes the limiting Fréchet subdifferential, and $\operatorname{dom}\partial f=\left\{{\mathbf{x}}\colon\partial f({\mathbf{x}})\neq\emptyset\right\}$ is its corresponding domain when applied to a function $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\left\{\infty\right\}$ , cf. [23, 20].

2 Main Results

Consider the multi-penalty problem (2) for $p=2$ , i.e. minimizing

[TABLE]

and denote a corresponding solution pair by

[TABLE]

As mentioned above ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ , ${\mathbf{y}}\in\mathbb{R}^{m}$ , $\alpha,\beta>0$ are regularization parameters balancing the contributions of the data-fidelity term and the two regularization terms, and $0<q\leq 1$ .

Let us introduce two widely known concepts relevant for the forthcoming discussion. First, the Kurdyka-Łojasiewicz (KŁ) property; a well-established tool for analyzing the convergence, and convergence rates, of proximal descent algorithms [4].

Definition 2.1.

A function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{\infty\}$ is said to have the KŁ property at $\bar{{\mathbf{x}}}\in\operatorname{dom}\partial f$ if there exists $\eta\in(0,+\infty]$ , a neighbourhood $\Omega$ of ${\mathbf{x}}$ , and a continuous concave function $\varphi:[0,\eta)\to\mathbb{R}_{+}$ such that

$\varphi\in C^{1}\left(0,\eta\right)$ , $\varphi(0)=0$ and $\varphi^{\prime}(s)>0$ for all $s\in(0,\eta)$ 2. 2.

For all ${\mathbf{x}}\in\Omega\cap\{{\mathbf{x}}:f(\bar{\mathbf{x}})<f({\mathbf{x}})<f(\bar{\mathbf{x}})+\eta\}$ the KŁ inequality holds

[TABLE]

The KŁ property is used to describe the speed of convergence through the desingularizing function $\varphi$ . It has been shown that semi-algebraic functions satisfy the KŁ property with $\varphi(s)=cs^{1-\theta}$ , where $c>0$ and $\theta\in[0,1)$ is called the KŁ constant, which characterizes the convergence speed of proximal gradient descent algorithms [4, Theorem 11]. As observed in [8], Corollary 3.6 in [19] may be used to determine the KŁ constant of piecewise convex polynomials. Even though $\|{\cdot}\|_{q}^{q}$ has the KŁ property, cf. [5, Example 5.4], it does not result in piece-wise convex polynomials for $0<q<1$ , and thus we cannot apply [19, Corollary 3.6] to infer the speed of convergence. We will instead adopt and adapt the ideas from [9, 28].

The second concept relevant for this paper is the restricted isometry property (RIP), which allows to control eigenvalues of small submatrices of ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ , and to characterize measurement operators that allow stable and robust reconstruction of sparse signals from $m\ll n$ measurements.

Definition 2.2.

A matrix ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ satisfies the restricted isometry property of order $s$ ( $s$ -RIP) with constant $\delta_{s}\in(0,1)$ , if for all $s$ -sparse ${\mathbf{u}}\in\mathbb{R}^{n}$

[TABLE]

Remark 2.3.

For a detailed treatment of RIP, and measurement operators that fulfill it, we refer the reader to [14]. Let us only mention that if the entries of ${\mathbf{A}}$ are i.i.d. copies of a Gaussian random variable with mean zero and variance $\frac{1}{m}$ , then

[TABLE]

measurements suffice to have an $s$ -RIP with constant $\delta_{s}>0$ with high probability, for an absolute constant $C>0$ . Consequently, $\delta_{s}={\cal O}\big{(}m^{-1/2}\sqrt{s\log(en/s)}\big{)}$ with high probability.

2.1 Augmented Formulation

It was observed in [16] that for $q=1$ , the multi-penalty problem (2) reduces to single-penalty regularization where measurement matrix and datum are adjusted by the regularization parameter $\beta$ . We include this result, extended to $0<q\leq 1$ , together with the proof (see Section A.1), which is analogous to [16, Lemma 1].

Lemma 2.4.

The pair $({{\mathbf{u}}_{\alpha,\beta}^{q}},{{\mathbf{v}}_{\alpha,\beta}^{q}})$ minimizes ${\cal T}_{\alpha,\beta}^{q}$ in (4) if and only if

[TABLE]

and ${{\mathbf{u}}_{\alpha,\beta}^{q}}$ is the solution of the augmented problem

[TABLE]

with

[TABLE]

Remark 2.5.

*The noise folding forward model (1) is in [2] written in the whitened form as $\tilde{\mathbf{y}}={\mathbf{B}}{\mathbf{u}}^{\dagger}+\boldsymbol{\mathrm{\eta}}$ , for $\tilde{\mathbf{y}}={\mathbf{Q}}^{-1/2}{\mathbf{y}}$ , ${\mathbf{B}}={\mathbf{Q}}^{-1/2}{\mathbf{A}}$ , $\boldsymbol{\mathrm{\eta}}={\mathbf{Q}}^{-1/2}({\mathbf{A}}{\mathbf{v}}+\boldsymbol{\mathrm{\xi}})$ , for ${\mathbf{Q}}=\frac{1}{c}(\sigma^{2}\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}+\sigma_{v}^{2}{\mathbf{A}}{\mathbf{A}}^{\top})$ and $c>0$ is a constant. Notice that this is particularly related to the augmented problem in (7).

On an unrelated note, improving on the analysis in [2, Proposition 2] one can show (see Lemma B.1) that the coherence, defined for a matrix ${\mathbf{M}}$ as*

[TABLE]

where ${\mathbf{m}}_{i}$ is the $i$ -th column of ${\mathbf{M}}$ , of the augmented measurement matrix ${\mathbf{B}}_{\beta}$ satisfies

[TABLE]

In compressed sensing literature, the magnitude of the coherence of a matrix is an important measure of quality for measurement matrices, cf. [14, Section 5]. The bound in (8) thus suggests that for small $\left\|{{\mathbf{A}}}\right\|$ or large $\beta$ , the linear measurement process modelled by ${\mathbf{B}}_{\beta}$ is as information preserving as the one modelled by ${\mathbf{A}}$ . In addition, Lemma B.2 shows that $\mathrm{coh}({\mathbf{B}}_{\beta})$ behaves like the coherence of a conditioned version of ${\mathbf{A}}$ if $\beta\rightarrow 0$ . Let us mention that in practice $\mathrm{coh}({\mathbf{B}}_{\beta})$ behaves well for all $\beta$ ’s, and even moderate values of $\|{\mathbf{A}}{\mathbf{A}}^{\top}\|$ .

By Lemma 2.4, to estimate the solution pair $({{\mathbf{u}}_{\alpha,\beta}^{q}},{{\mathbf{v}}_{\alpha,\beta}^{q}})$ it is sufficient to first solve (7), and then insert the computed solution into (6). Since the fidelity term $\frac{1}{2}\|{\mathbf{B}}_{\beta}{\mathbf{u}}-{\mathbf{y}}_{\beta}\|_{2}^{2}$ is smooth and the regularization term $\left\|{{\mathbf{u}}}\right\|_{q}^{q}$ non-convex, the common approach is to use iterative thresholding through a forward-backward splitting algorithm [9, 4]. For $\mathcal{F}_{\beta}$ and the augmented problem (7), the resulting thresholding iterations applied are readily written as

[TABLE]

Each iteration in (9) can be viewed as a thresholded Landweber iteration; we first perform a step in the direction of the negative gradient of the data fidelity term, and then apply the proximal operator of the remaining non-convex term.

The proximal operator of a function $\Psi:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is defined by

[TABLE]

where $\mu,\nu>0$ . For separable mappings (10) can be applied component-wise, and we have $\operatorname{prox}_{\mu,\nu\|{\cdot}\|_{q}^{q}}({\mathbf{u}})=\left(\operatorname{prox}_{\mu,\nu|\cdot|^{q}}(\mathsf{u}_{i})\right)_{i=1}^{n}.$ In the general case, the proximal operator (10) could be set-valued, since there might be multiple or even no minima. It can be shown though that for $0<q<1$ the (one-dimensional) proximal operator of $\left|\cdot\right|^{q}$ satisfies

[TABLE]

The range of $\operatorname{prox}_{\mu,\nu\left|{\cdot}\right|^{q}}$ is $(-\infty,-\lambda_{\mu,q}]\cup\{0\}\cup[\lambda_{\mu,q},\infty)$ where $\lambda_{\mu,q}=\left(2\mu\nu(1-q)\right)^{\frac{1}{2-q}}$ , see [9, Lemma 5.1], and it is discontinuous with a jump discontinuity222While the actual proximal operator of $\left|\cdot\right|^{q}$ is set-valued and simultaneously assumes both possible values at $\left|\mathsf{u}\right|=\tau_{\mu}$ , we follow common practice when restricting the operator to zero at $\left|\mathsf{u}\right|=\tau_{\mu}$ to have a single-valued function. at $|\mathsf{u}|=\tau_{\mu}$ . Note that the proximal operators in (11) are indeed thresholding operators, and as $q$ goes from [math] to $1$ they interpolate between hard- and soft-thresholding operators. Moreover, a closed form of the operator $\operatorname{prox}_{\mu,\nu|\cdot|^{q}}$ is known only in special cases, namely for $q=1/2$ and $q=2/3$ [26].

It follows easily that if the step-size $\mu>0$ is small enough (smaller than $\|{\mathbf{B}}_{\beta}\|^{-2}$ ), the difference of iterates in (9) decreases, i.e. $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\rightarrow 0$ as $k\rightarrow\infty$ , see [9, Proposition 2.1]. Note that the iterations in (9) are quite different from those given by alternating minimization, where for each $k$ we need to compute ${\mathbf{u}}^{k+1}$ through iterative thresholding. The following lemma makes this more precise; it shows that (9) is equivalent to performing only the first step of iterative thresholding when computing ${\mathbf{u}}^{k+1}$ in (3). The proof can be found in Section A.2.

Lemma 2.6.

The iterations defined in (9) can be rewritten as

[TABLE]

which corresponds to a single proximal gradient descent step of (3) starting at ${\mathbf{u}}^{k}$ .

2.1.1 Linear Convergence

We now show that the iterates in (9) converge at a linear rate to stationary points ${\mathbf{u}}^{\star}$ of ${\cal T}_{\alpha,\beta}^{q}$ , i.e. points such that $\boldsymbol{0}\in\partial{\cal T}_{\alpha,\beta}^{q}({\mathbf{u}}^{\star})$ , and characterize the convergence constant in dependence of design parameters. Let us emphasize that since our analysis is tailored to $\ell_{q}$ -regularization we derive more explicit guarantees (in terms of the involved parameters) than what would follow by directly applying the more general statements of [28] to the augmented formulation (7). The proof can be found in Section A.3.

Theorem 2.7.

Let $\alpha,\beta>0$ and $0<q\leq 1$ . Assume the matrix ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ has RIP of order $s$ with a constant $\delta_{s}\in(0,1)$ , and let the stepsize $\mu$ satisfy $0<\mu<\|{\mathbf{A}}\|^{-2}+\beta^{-1}$ . Moreover, assume333The sequence ${\mathbf{u}}^{k}$ converges provably to a stationary point since ${\cal T}_{\alpha,\beta}^{q}$ is among other things coercive and has the KL-property, cf. [5, Theorem 5.1]. The assumption thus is not about whether ${\mathbf{u}}_{k}$ converges but about the specific limit point which mainly depends on the concrete choice of initialization. ${\mathbf{u}}^{\star}\in\mathbb{R}^{n}$ is such that $|\mathrm{supp}({\mathbf{u}}^{\star})|\leq s$ and the iterates (9) satisfy ${\mathbf{u}}^{k}\rightarrow{\mathbf{u}}^{\star}$ . Define $I=\mathrm{supp}({\mathbf{u}}^{\star})$ and ${d_{\min}}=\min_{i\in I}|\mathsf{u}^{\star}_{i}|$ . Then there exists $k_{0}\in\mathbb{N}$ such that for all $k\geq k_{0}$ we have

[TABLE]

Remark 2.8.

(i)

To have linear convergence in Theorem 2.7, we have to choose $\alpha$ such that

[TABLE]

This resembles basic assumptions of the main result in **[28]**. One should thus interpret Theorem 2.7 as an additional refinement, better capable of predicting numerical behavior. 2. (ii)

Theorem 2.7 suggests that the convergence constant depends on the sparsity of the signal and properties of ${\mathbf{A}}$ . Namely, if the signal is sparser (and thus $\delta_{s}$ smaller) then the convergence constant decreases. Similarly, the constant decreases if we increase the number of measurements. 3. (iii)

Assuming $\alpha=c\alpha^{\star}$ , for $c\in(0,1)$ , it is straight-forward to check that the rate in Theorem 2.7 becomes minimal by choosing $\mu\approx\|{\mathbf{A}}\|^{-2}+\beta^{-1}$ . In this case the result transforms into

[TABLE] 4. (iv)

Since $\alpha$ and $\beta$ control the strength of regularization in ${\cal T}_{\alpha,\beta}^{q}$ , their choice depends on the expected noise level. Consequently, when setting $\alpha$ and $\beta$ one needs to make a trade-off between their regularizing effect and the desired convergence speed.

2.1.2 Computational Complexity

Once ${\mathbf{B}}_{\beta}$ has been computed, executing (9) for a constant number of iterations costs ${\cal O}(mn)$ operations: ${\cal O}(mn)$ for matrix-vector products and ${\cal O}(n)$ for evaluating the proximal operator. But this gets dominated by the operations needed to obtain ${\mathbf{B}}_{\beta}$ , which involve a matrix square root and a matrix-matrix linear system and have to be done in advance. This turns out to be a computational bottleneck as soon as $m\geq n^{\frac{1}{\rho-1}}$ as it requires ${\cal O}(m^{\rho})$ operations, where $\rho\in[2.37,3]$ depends on the used algorithmic method [11]. Such a computational cost can be prohibitive for high-dimensional applications.

2.2 Infimal Convolution Formulation

To overcome the computational limitations observed above, we consider an alternative approach. Define a new program by

[TABLE]

where the infimal convolution is given by

[TABLE]

For a detailed treatment of infimal convolution and its properties, see [6]. It is straight-forward to check that an equivalence between minimizing (4) and (13) holds.

Lemma 2.9.

The pair $({{\mathbf{u}}_{\alpha,\beta}^{q}},{{\mathbf{v}}_{\alpha,\beta}^{q}})$ minimizes ${\cal T}_{\alpha,\beta}^{q}$ in (4) if and only if ${{\mathbf{u}}_{\alpha,\beta}^{q}}+{{\mathbf{v}}_{\alpha,\beta}^{q}}$ solves (13) while ${{\mathbf{u}}_{\alpha,\beta}^{q}}$ attains the infimal value of $\left(\frac{\alpha}{p}\|{\cdot}\|_{q}^{q}\Delta\frac{\beta}{2}\|\cdot\|_{2}^{2}\right)({{\mathbf{u}}_{\alpha,\beta}^{q}}+{{\mathbf{v}}_{\alpha,\beta}^{q}})$ .

In order to solve (13) via iterative thresholding (i.e. proximal gradient descent), we need to efficiently evaluate the proximal operator of (14). A helpful observation is that (14) can be interpreted as the Moreau-envelope of $\|{\cdot}\|_{q}^{q}$ , which for a function $f$ and $t>0$ is defined as

[TABLE]

where the last equality only holds if $\operatorname{prox}_{t,f}({\mathbf{x}})\neq\emptyset$ . It has been observed in [7, Theorem 6.63] that computing the proximal operator of the Moreau envelope reduces to computing the proximal operator of the underlying function. Though stated only for convex functions in [7], it is straight-forward to generalize the result.

Lemma 2.10.

Let $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ be a lower semi-continuous function with $f(0)=\min f$ . Then,

[TABLE]

The proof is in Section A.4. Define now the proximal gradient descent for (13) by

[TABLE]

We denote by ${\mathbf{u}}^{k}=\operatorname{prox}_{\frac{1}{\beta},\frac{\alpha}{q}\|{\cdot}\|_{q}^{q}}({\mathbf{w}}^{k})$ the sequence of minimizers attaining $g({\mathbf{w}}^{k})$ , and set ${\mathbf{v}}^{k}={\mathbf{w}}^{k}-{\mathbf{u}}^{k}$ . Note that with this notation ${\mathbf{w}}^{k}$ and ${\mathbf{u}}^{k}$ can also be characterized via

[TABLE]

Unlike (15), the representation in (16) does not yield a practically viable algorithm, since ${\mathbf{w}}^{k}$ and ${\mathbf{u}}^{k}$ are not decoupled. It does though lend itself to theoretical analysis of the iterations, cf. Section A.5.

2.2.1 Linear Convergence

Though $g$ in (14) is continuous and separable, i.e. $g({\mathbf{w}})=\sum_{i=1}^{n}g_{i}(\mathsf{w}_{i})$ , it is not continuously differentiable, such that we cannot apply [28] to deduce linear convergence of (15). Nevertheless, using the KKT-conditions of the objective functions in (16), we get linear convergence of the iterates in (15) by a similar strategy as in Theorem 2.7.

Theorem 2.11.

Let $\alpha,\beta>0$ and $0<q\leq 1$ . Assume444Along the lines of Footnote 3 in Theorem 2.7. Just note that $g$ in (14) has the KL-property by [27, Theorem 3.1] and, hence, the objective function in (13) has it as well. that $0<\mu<\|{\mathbf{A}}\|^{-2}$ and ${\mathbf{w}}^{k}\rightarrow{\mathbf{w}}^{\star}$ . Let $I\subset[n]$ denote the support of ${\mathbf{u}}^{\star}=\operatorname{prox}_{\frac{1}{\beta},\frac{\alpha}{q}\|{\cdot}\|_{q}^{q}}({\mathbf{w}}^{\star})$ and define ${d_{\min}}=\min_{i\in I}\left|\mathsf{u}_{i}^{\star}\right|$ . Then there exists $k_{0}\in\mathbb{N}$ such that for all $k\geq k_{0}$ we have

[TABLE]

The proof of Theorem 2.11 is given in Section A.5.

Remark 2.12.

On the one hand, in Theorem 2.11 the assumption on $\mu$ and the rate differ from Theorem 2.7; there is no influence of $\beta$ on admissible step-sizes and the rate is split in two distinct components. On the other hand, since, for $\mu<\|{\mathbf{A}}\|^{-2}$ ,

[TABLE]

the rate in Theorem 2.11 suggests to choose $\beta$ large to dominate the second term of the rate in which case the assumptions on $\mu$ agree in both theorems. Moreover, this reduces the rate to

[TABLE]

where the denominator is as in Theorem 2.7. In light of (17), we get linear convergence of (15) if

[TABLE]

As already discussed in Remark 2.8, a trade-off between regularization and convergence rate has to be taken into account when choosing $\alpha$ and $\beta$ .

Remark 2.13.

For $q=1$ , an alternative viewpoint on (16) is given by

[TABLE]

where we used [22, Eq. (3.3)] in the last step, meaning that

[TABLE]

is a proximal gradient descent sequence of $\|\nabla M_{\frac{\alpha}{\beta}\left\|{\cdot}\right\|_{1}}(\cdot)\|_{2}^{2}$ , the squared $\ell_{2}$ -norm of the gradient of the smooth Moreau approximation of $\frac{\alpha}{\beta}\left\|{\cdot}\right\|_{1}$ . From this perspective, multi-penalty regularization resembles a Newton-type method by searching for zeros of the derivative of a smooth approximation of the $\ell_{1}$ -norm. However, transferring this intuition to the case $q<1$ is non-trivial. On a technical level the equations in (18) break down in the third line, which does not hold for $q<1$ due to non-convexity of $\|{\cdot}\|_{q}^{q}$ .

2.2.2 Computational Complexity

While (9) requires computing ${\mathbf{B}}_{\beta}$ , which can be costly, the infimal convolution formulation (15) does not incur additional computational costs and thus directly inherits efficiency and linear convergence of the proximal descent method. Indeed, for a fixed number of iterations the number of operations performed in (15) is ${\cal O}(mn)$ (the additional convex combination when evaluating the proximal operator by Lemma 2.10 is negligible). This is considerably lower than ${\cal O}(m^{\rho})$ , for $\rho\in[2.37,3]$ , which is the computational cost of the augmented formulation, particularly if $m$ is large. In numerical simulations, this effect is easy to observe, cf. Section 3.

3 Numerical Experiments

We now present experimental results that focus on two aspects of our study. First, we examine the convergence rate of the proposed algorithms, confirming linear convergence and in case of the augmented formulation, the dependence of the convergence constant on the parameters of the problem. Second, we examine their efficiency by studying the overall computational effort on larger scale problems.

3.1 Convergence Rate

Via the RIP-constant $\delta_{s}$ Theorem 2.7 gives a direct dependence of the convergence rate on the sparsity of the solution and the properties of the matrix, whereas Theorem 2.11 is harder to interpret: it is straight-forward to deduce the existence of parameter regimes in which linear convergence occurs but hard to quantify the rate in terms of the parameters. While numerical evidence for linear convergence of the infimal convolution formulation is observed in Section 3.2, we continue by validating Theorem 2.7 in two experiments. In both, we take $q=1/2$ , and add pre- and post-measurement Gaussian noise terms, ${\mathbf{v}}$ and $\boldsymbol{\mathrm{\xi}}$ , with noise level $\frac{\|{{\mathbf{v}}}\|}{\|{{\mathbf{u}}^{\dagger}}\|}=\frac{\|{\boldsymbol{\mathrm{\xi}}}\|}{\|{{\mathbf{u}}^{\dagger}}\|}=0.1$ . We choose an admissible $\alpha$ according to Remark 2.8 and tune it such that the reconstructed signal shares its support size with the ground-truth. Both illustrations in Figure 1 plot the relative error between the iterates ${\mathbf{u}}^{k}$ and the stationary point ${\mathbf{u}}^{\star}$ against the number of proximal gradient descent steps.

Varying the Penalty Parameter.

In the first experiment we take a Gaussian matrix ${\mathbf{A}}\in\mathbb{R}^{200\times 600}$ , a $20$ -sparse signal ${\mathbf{u}}^{\dagger}$ , and vary $\beta$ . Theorem 2.7 predicts that smaller values of $\beta$ allow to take larger stepsizes, though the convergence constants are (essentially) the same. This effect is readily observed in Figure 1(a). Note that we can also observe that for smaller $\beta$ the algorithm reaches the steep part of the curve faster. This is due to the fact that the convergence of iterates is initially slow (until the support is identified) and larger step-sizes allow to reduce the support size faster. The overall speed-up allowed by a smaller $\beta$ can be by up to a two-fold, in terms of the number of iterations needed to reach the desired accuracy level.

Varying the Measurements.

In the second experiment we consider a Gaussian matrix ${\mathbf{A}}\in\mathbb{R}^{m\times 600}$ , for $m\in\{100,200,300,400\}$ , and a $20$ -sparse signal ${\mathbf{u}}^{\dagger}$ . Varying the number of measurements changes the RIP of the measurement matrix (a larger $m$ decreases $\delta_{s}$ , see Remark 2.3), and per Theorem 2.7 should affect the convergence constant. Figure 1(b) shows exactly that. An analogous effect can be observed for different classes of measurement matrices, such as partial Toeplitz, or partial circulant matrices with Rademacher or Gaussian entries, but those results have not been included for the sake of brevity.

3.2 Computational Comparison

Iteration Count.

In order to provide numerical evidence for our initial statement that alternating minimization is highly sub-optimal, in Figure 2(a) we look at the decay of the relative error over the number of basic iterations, i.e. the number of thresholded gradient descent steps, of all three discussed approaches: alternating minimization (3), augmented formulation (9), and infimal convolution (15). In this experiment, we use a Gaussian matrix ${\mathbf{A}}\in\mathbb{R}^{100\times 500}$ , the original signal is $14$ -sparse, $q=1/2$ and the parameter $\alpha$ , $\beta$ , and $\mu$ are selected so that each method returns a $13$ -sparse vector. The $x$ -axis refers to the number of times the proximal operator is called while the $y$ -axis shows the relative error. The considerably worse performance of alternating minimization is due to the fact that it requires (too) many thresholded gradient steps to solve, for each $k\in\mathbb{N}$ , sub-problems for the ${\mathbf{u}}^{k}$ component up to pre-fixed accuracy $\varepsilon=10^{-8}$ . Thus, the algorithm performs hardly any alternating steps.

Computation Time.

To now illustrate the differences between augmented and infimal convolution formulation in terms of computational complexity, we perform the following experiment. We set the parameters generically to $\alpha=0.02$ , $\beta=0.2$ , and $\mu=0.1$ , and reconstruct a $100$ -sparse signal ${\mathbf{u}}^{\dagger}\in\mathbb{R}^{5000}$ from measurements ${\mathbf{y}}\in\mathbb{R}^{m}$ , for $m$ varying from $1000$ (sub-sampling) to $8000$ (over-sampling). We again take $q=1/2$ , and add pre- and post-measurement noise terms, ${\mathbf{v}}$ and $\boldsymbol{\mathrm{\xi}}$ , with noise level $0.1$ . Averaging over $20$ random realizations of ${\mathbf{u}}^{\dagger}$ , we record for augmented (9) and infimal convolution approach (15) the time needed to perform $50$ iterations. After such few iterations none of the two algorithms has converged, though this already suffices to make a point regarding the computational cost since both algorithms incur the same cost (i.e. the gap remains the same) in the remaining iterations. As Figure 2(b) shows, the additional computation of ${\mathbf{B}}_{\beta}$ in (9) causes a massive additional workload leading to limited applicability of the augmented approach in large-scale settings. In contrast, the infimal convolution formulation is hardly affected by the increase in the number of measurements. Though the augmented approach tends to converge in fewer iterations, cf. Figure 2(a), the additional iterations needed by the infimal convolution formulation to reach a comparable level of accuracy do not close the gap in computation time. Note that we do not include alternating minimization here since it requires many more iterations (in the sense of single thresholded gradient descent steps) to show similar reconstruction performance as both proximal descents, and hence could not compete with those two algorithms.

4 Discussion

In the present work we discussed the benefits of multi-penalty regularization for support recovery of signals when pre-measurement noise is amplified by the measurement operator and numerical challenges in solving the corresponding variational formulation. Since alternating minimization is for this task sub-optimal in terms of both the computational efficiency and theoretical analysis, we proposed a novel reduction to single-penalty regularization based on infimal convolution, and compared this new approach to an existing reduction based on augmented formulations. Moreover, we established linear convergence for both single-penalty reductions and showed that our new approach omits a computational bottleneck that is unavoidable in the augmented approach, and causes a significant additional computational workload if the number of measurements increases. There are several interesting open questions left for future work.

First, in Remark 2.13 we observed, for $q=1$ , a connection between the infimal convolution formulation and the proximal descent on the $\ell_{2}$ -norm of the gradient of a Moreau-regularized $\ell_{1}$ -functional. As we have not seen a comparable relation in the context of multi-penalty regularization so far, we are curious whether this observation can be extended to the case $0<q<1$ . If so, this might provide valuable insights into non-convex optimization.

Second, as the reader might have noticed, great parts of the arguments we used (support stabilization, sign stabilization, etc.) are not restricted to finite dimensions. In light of more general settings of multi-penalty regularization in [21] and single-penalty regularization in [9], it would be fruitful to transfer our findings to general separable Hilbert spaces as well.

Third, we mention that when using the infimal convolution based approach, in some experiments it was possible to choose $\mu$ much larger than suggested by Theorem 2.11, while still observing reliable convergence of the program. We wonder whether there is an alternative proof leading to a relaxed condition on $\mu$ resembling the assumption in Theorem 2.7.

Let us conclude by emphasizing that the infimal convolution formulation can as well be applied if regularizers other than the $\ell_{q}$ -norm are used in the multi-penalty problem, e.g. Smoothly Clipped Absolute Deviation (SCAD) [13], Minimax Concave Penalty (MCP) [29], and Log-Sum Penalty (LSP) [10]. In those cases the more general single-penalty rate analysis in [28] should prove useful as a tool.

Acknowledgment

ZK and VN acknowledge the support from RCN-funded FunDaHD project No 251149/O70. JM acknowledges the support of DFG-SPP 1798.

Appendix A Proofs

A.1 Proof of Lemma 2.4

For a fixed ${\mathbf{u}}$ the minimization of ${\cal T}_{\alpha,\beta}^{q}$ in (4) with respect to ${\mathbf{v}}$ reduces to Tikhonov minimization, and thus the solution satisfies

[TABLE]

Rewriting the above expression we have

[TABLE]

Plugging this expression into (4) the minimization problem for $u$ is rewritten as

[TABLE]

The Woodbury identity for invertible matrices ${\mathbf{V}}\in\mathbb{R}^{m\times m}$ , ${\mathbf{W}}\in\mathbb{R}^{n\times n}$ and matrices ${\mathbf{M}}_{1}\in\mathbb{R}^{m\times n}$ , ${\mathbf{M}}_{2}\in\mathbb{R}^{n\times m}$ reads

[TABLE]

Using (19), this gives

[TABLE]

Plugging this expression back into ${\cal T}_{\alpha,\beta}^{q}({\mathbf{u}},v({\mathbf{u}}))$ , and extracting the square root, we have ${\cal T}_{\alpha,\beta}^{q}({\mathbf{u}},v({\mathbf{u}}))={\cal F}_{\beta}({\mathbf{u}})$ . Minimizing over ${\mathbf{u}}$ and using the following simple observation gives the conclusion.

Lemma A.1.

If ${{\mathbf{u}}_{\alpha,\beta}^{q}}$ is a local minimizer of (7), then the pair $({{\mathbf{u}}_{\alpha,\beta}^{q}},v({{\mathbf{u}}_{\alpha,\beta}^{q}}))$ with $v(u)$ defined in (6), is a local minimizer of ${\cal T}_{\alpha,\beta}^{q}$ in (4).

Proof.

Let ${{\mathbf{u}}_{\alpha,\beta}^{q}}$ be a local minimizer of $\eqref{eqn:q_augmented_problem}$ and assume there exists a sequence $({\mathbf{u}}^{k},{\mathbf{v}}^{k})\rightarrow({{\mathbf{u}}_{\alpha,\beta}^{q}},v({{\mathbf{u}}_{\alpha,\beta}^{q}}))$ such that $\mathcal{T}_{\alpha,\beta}^{q}({\mathbf{u}}^{k},{\mathbf{v}}^{k})<\mathcal{T}_{\alpha,\beta}^{q}({{\mathbf{u}}_{\alpha,\beta}^{q}},v({{\mathbf{u}}_{\alpha,\beta}^{q}}))$ , for all $k\in\mathbb{N}$ . We then have

[TABLE]

where the first inequality follows from the minimality of $v({\mathbf{u}}^{k})$ . This contradicts the assumption that ${{\mathbf{u}}_{\alpha,\beta}^{q}}$ is a local minimizer of $\eqref{eqn:q_augmented_problem}$ . ∎

A.2 Proof of Lemma 2.6

First note that

[TABLE]

while

[TABLE]

Hence, it suffices to show that

[TABLE]

Extracting ${\mathbf{A}}^{\top}$ from the left and using the Woodbury identity (20) with ${\mathbf{M}}_{1}={\mathbf{A}},{\mathbf{M}}_{2}={\mathbf{A}}^{\top}$ , ${\mathbf{W}}=\beta\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{n}$ , and ${\mathbf{V}}=\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}$ the conclusion follows.

A.3 Proof of Theorem 2.7

In order to prove Theorem 2.7, we have to control the eigenvalues of ${\mathbf{B}}_{\beta}^{\top}{\mathbf{B}}_{\beta}$ characterizing the growth of the data fidelity term in (7).

Lemma A.2.

For ${\mathbf{B}}_{\beta}\in\mathbb{R}^{m\times n}$ defined as in Lemma 2.4,

[TABLE]

is the Lipschitz-constant of the gradient of the augmented data-fidelity term $\frac{1}{2}\|{\mathbf{B}}_{\beta}{\mathbf{u}}-{\mathbf{y}}_{\beta}\|_{2}^{2}$ . Moreover, for any $I\subset[n]$ ,

[TABLE]

Proof.

Let ${\mathbf{A}}={\mathbf{U}}\mathbf{\Sigma}{\mathbf{V}}^{\top}$ denote the SVD of ${\mathbf{A}}$ . This gives

[TABLE]

so that $\|{\mathbf{B}}_{\beta}^{\top}{\mathbf{B}}_{\beta}\|=\left(\|{\mathbf{A}}\|^{-2}+\beta^{-1}\right)^{-1}.$ By (21), we have for any ${\mathbf{z}}\in\mathbb{R}^{n}$

[TABLE]

implying the second claim. ∎

We can now show that all, up to finitely many, iterates $\left({\mathbf{u}}^{k}\right)_{k=1}^{\infty}$ generated by (9) share the same support and sign pattern. The proof is standard and follows [9].

Lemma A.3 (Support and sign recovery).

Assume $\beta>0$ , $0<q\leq 1$ , and $\mu<\|{\mathbf{A}}\|^{-2}+\beta^{-1}$ . Then the iterates $\left({\mathbf{u}}^{k}\right)_{k=1}^{\infty}$ satisfy $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\rightarrow 0$ as $k\rightarrow\infty$ . Moreover, all iterates, up to finitely many, have the same support and sign pattern.

Proof.

Since $\mu<\|{\mathbf{A}}\|^{-2}+\beta^{-1}=\frac{1}{L}$ we have $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\rightarrow 0$ as $k\rightarrow\infty$ by [9, Corollary 2.1]. Now, since the range of $\operatorname{prox}_{\mu,\lambda\psi}$ is $(-\infty,-\lambda_{\mu,q}]\cup\{0\}\cup[\lambda_{\mu,q},\infty)$ , it follows that the the absolute value of a non-zero entry of ${\mathbf{u}}^{k}$ , for $k\geq 1$ , is at least $\lambda_{\mu,q}$ . Thus, if $\mathrm{supp}({\mathbf{u}}^{k+1})\neq\mathrm{supp}({\mathbf{u}}^{k})$ we have $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\geq\lambda_{\mu,q}$ , and analogously, if $\operatorname{sgn}({\mathbf{u}}^{k+1})\neq\operatorname{sgn}({\mathbf{u}}^{k})$ we have $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\geq 2\lambda_{\mu,q}$ . Thus, since $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\rightarrow 0$ as $k\rightarrow\infty$ , sign and support can change only finitely many times. ∎

Proof of Theorem 2.7.

By Lemma A.3 there exists $k_{0}$ such that for all $k\geq k_{0}$ the support of ${\mathbf{u}}^{k}$ is finite, and support and sign of ${\mathbf{u}}^{k}$ is equal to that of ${\mathbf{u}}^{\star}$ . Thus, by [9, Proposition 2.3], ${\mathbf{u}}^{\star}$ is a fixed point of (9). Denote $I=\mathrm{supp}({\mathbf{u}}^{\star})$ with $|I|\leq s$ . The definition of proximal operator in (10) and the Karush-Kuhn-Tucker (KKT) conditions yield

[TABLE]

and

[TABLE]

Subtracting the two equations on the index set $I$ , and denoting $\psi({\mathbf{u}})=\frac{1}{q}\left\|{{\mathbf{u}}}\right\|_{q}^{q}$ , we have

[TABLE]

where $\psi^{\prime}({\mathbf{u}})=(\operatorname{sgn}(\mathsf{u}_{i})\left|{\mathsf{u}_{i}}\right|^{q-1})_{i\in[n]}$ is acting entry-wise. Note that since $k\geq k_{0}$ we have $\operatorname{sgn}({\mathbf{u}}^{\star}_{I})=\operatorname{sgn}({\mathbf{u}}^{k+1}_{I})$ and $\left\|{{\mathbf{u}}^{\star}-{\mathbf{u}}^{k}}\right\|_{2}=\|{\mathbf{u}}^{\star}_{I}-{\mathbf{u}}^{k}_{I}\|_{2}$ . A straightforward calculation gives

[TABLE]

where ${\mathbf{M}}={\mathbf{B}}_{\beta}^{\top}{\mathbf{B}}_{\beta}$ . Taking the inner product of (23) with ${\mathbf{u}}_{I}^{k+1}-{\mathbf{u}}^{\star}_{I}$ , and applying the Cauchy-Schwartz inequality, we get

[TABLE]

Since $\psi$ is twice differentiable, and ${\mathbf{u}}^{k+1}$ and ${\mathbf{u}}^{\star}$ have the same sign and support, we have for the second term

[TABLE]

where $C_{i}^{k+1}$ lies between $\mathsf{u}^{k+1}_{i}$ and $\mathsf{u}^{\star}_{i}$ , and $\psi^{\prime\prime}(\mathsf{u})=(q-1)\mathsf{u}^{q-2}$ . Since ${\mathbf{u}}^{k}\rightarrow{\mathbf{u}}^{\star}$ , we may assume $k_{0}$ sufficiently large to guarantee $\mathsf{u}_{i}^{k}\geq\frac{1}{2}\mathsf{u}_{i}^{\star}$ , for all $k\geq k_{0}$ and $i\in I$ . Consequently,

[TABLE]

Thus,

[TABLE]

On the other hand, since $\mu<(\lambda_{\max}({\mathbf{M}}))^{-1}\leq(\lambda_{\min}({\mathbf{M}}_{I,I}))^{-1}$ , we have

[TABLE]

by Lemma A.2. Thus,

[TABLE]

Together with the RIP of ${\mathbf{A}}$ this yields the claim. ∎

A.4 Proof of Lemma 2.10

Let ${\mathbf{x}}\in\mathbb{R}^{n}$ be fixed and assume $f(0)=0$ without loss of generality. We have

[TABLE]

By $f$ being lower semi-continuous and bounded from below, we have

[TABLE]

implying $\operatorname{prox}_{\mu,\lambda M_{t,f}}({\mathbf{x}})\neq\emptyset$ . Denote by ${\cal E}_{\tilde{{\mathbf{z}}}}=\left\{\theta{\mathbf{x}}+(1-\theta)\tilde{{\mathbf{z}}}\colon\theta\in[0,1]\right\}$ the line connecting ${\mathbf{x}}$ and $\tilde{{\mathbf{z}}}$ . Since ${\cal E}_{\tilde{{\mathbf{z}}}}$ is convex, we have $h({\mathbf{P}}_{{\cal E}_{\tilde{{\mathbf{z}}}}}({\mathbf{z}}),\tilde{{\mathbf{z}}})\leq h({\mathbf{z}},\tilde{{\mathbf{z}}})$ , for any ${\mathbf{z}},\tilde{{\mathbf{z}}}\in\mathbb{R}^{n}$ , with equality if and only if ${\mathbf{P}}_{{\cal E}_{\tilde{{\mathbf{z}}}}}({\mathbf{z}})={\mathbf{z}}$ . Consequently, if $({\mathbf{z}},\tilde{{\mathbf{z}}})$ solves the above program, we have ${\mathbf{z}}=\theta{\mathbf{x}}+(1-\theta)\tilde{{\mathbf{z}}}$ for some $\theta\in[0,1]$ . Let us define

[TABLE]

By the above considerations we have

[TABLE]

where there is a one-to-one correspondence between solutions $({\mathbf{z}}^{\star},\tilde{{\mathbf{z}}}^{\star})$ of the left side and solutions $(\theta^{\star},\tilde{{\mathbf{z}}}^{\star})$ . Moreover, it follows easily that for $\tilde{{\mathbf{z}}}\in\mathbb{R}^{n}$ fixed,

[TABLE]

which is independent of $\tilde{{\mathbf{z}}}$ . Thus, the claim follows since

[TABLE]

A.5 Proof of Theorem 2.11

As in the proof of Theorem 2.7, the first step is to control support and signs of the iterates. Recall that, for ${\mathbf{w}}^{k}$ as in (15), we denote by ${\mathbf{u}}^{k}=\operatorname{prox}_{\frac{1}{\beta},\frac{\alpha}{q}\|{\cdot}\|_{q}^{q}}({\mathbf{w}}^{k})$ the sequence of minimizers attaining $g({\mathbf{w}}^{k})$ , by ${\mathbf{v}}^{k}={\mathbf{w}}^{k}-{\mathbf{u}}^{k}$ , and that by (16) we have

[TABLE]

Lemma A.4 (Sign and support stability).

Assume $\mu<\|{\mathbf{A}}\|^{-2}$ . Then the successive iterates $\|{\mathbf{w}}^{k+1}-{\mathbf{w}}^{k}\|_{2}$ , $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}$ , and $\|{\mathbf{v}}^{k+1}-{\mathbf{v}}^{k}\|_{2}$ converge to zero and all but finitely many iterates ${\mathbf{u}}^{k}$ share the same finite support and the same signs.

Proof.

First, note that $g$ is a proper and coercive function. Second, as $g({\mathbf{w}})=\inf_{{\mathbf{u}}\in\mathbb{R}^{n}}f({\mathbf{u}},{\mathbf{w}})$ , for $f$ continuous, we obtain continuity of $g$ at any point ${\mathbf{w}}\in\mathbb{R}^{n}$ since by coercivity of $f$ the infimum can be restricted to a finite ball and the infimum of continuous functions on a compact set is continuous. Consequently, by [9, Corollary 2.1] and the assumption on $\mu$ we have $\|{\mathbf{w}}^{k+1}-{\mathbf{w}}^{k}\|_{2}\rightarrow 0$ , for ${\mathbf{w}}^{k+1}=\operatorname{prox}_{\mu,g}({\mathbf{w}}^{k}-\mu{\mathbf{A}}^{\top}({\mathbf{A}}{\mathbf{w}}^{k}-{\mathbf{y}}))$ . By the KKT-conditions of (24), we obtain

[TABLE]

Subtracting the two equations gives $\|{\mathbf{v}}^{k+1}-{\mathbf{v}}^{k}\|_{2}\rightarrow 0$ , and ${\mathbf{u}}^{k}={\mathbf{w}}^{k}-{\mathbf{v}}^{k}$ yields $\|{\mathbf{u}}^{k+1}-{\mathbf{u}}^{k}\|_{2}\rightarrow 0$ . The second claim follows as in Lemma A.3, since ${\mathbf{u}}^{k}$ is a thresholded version of ${\mathbf{w}}^{k}$ . ∎

Proof of Theorem 2.11.

First note that ${\mathbf{w}}^{k}\rightarrow{\mathbf{w}}^{\star}$ implies via Lemma A.4 that ${\mathbf{u}}^{k}\rightarrow{\mathbf{u}}^{\star}$ and ${\mathbf{v}}^{k}\rightarrow{\mathbf{v}}^{\star}$ . Furthermore, ${\mathbf{w}}^{\star}$ is a fixed point of (15), by [9, Proposition 2.3]. By Lemma A.4 there exists $k_{0}$ such that for all $k\geq k_{0}$ the support of ${\mathbf{u}}^{k}$ is finite, and support and sign of ${\mathbf{u}}^{k}$ is equal to that of ${\mathbf{u}}^{\star}$ . Denote $I=\mathrm{supp}({\mathbf{u}}^{\star})$ . By the KKT-conditions of (24), we get

[TABLE]

and

[TABLE]

For $\psi({\mathbf{u}})=\frac{1}{q}\left\|{{\mathbf{u}}}\right\|_{q}^{q}$ with $\psi^{\prime}({\mathbf{u}})=(\operatorname{sgn}(\mathsf{u}_{i})\left|{\mathsf{u}_{i}}\right|^{q-1})_{i\in[n]}$ acting entry-wise, this implies

[TABLE]

and

[TABLE]

Repeating the steps as in Theorem 2.7, from (25) we get

[TABLE]

and from (26) we obtain

[TABLE]

Squaring and summing the last two equations, the claim follows by orthogonality of $({\mathbf{w}}^{k+1}-{\mathbf{w}}^{\star})_{I}$ and $({\mathbf{w}}^{k+1}-{\mathbf{w}}^{\star})_{I^{c}}$ . ∎

Appendix B Coherence Bound

The following Lemma bounds the coherence of ${\mathbf{B}}_{\beta}$ in terms of the coherence of ${\mathbf{A}}$ . The bound becomes tight for large choices of $\beta$ .

Lemma B.1.

We have

[TABLE]

Proof.

Recall that the coherence of a matrix is defined as

[TABLE]

where ${\mathbf{m}}_{i}$ is the $i$ -th column of ${\mathbf{M}}$ . Define ${\mathbf{Q}}_{\beta}=\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}+\frac{{\mathbf{A}}{\mathbf{A}}^{\top}}{\beta}$ , so that ${\mathbf{B}}_{\beta}={\mathbf{Q}}_{\beta}^{-1/2}{\mathbf{A}}$ , and let ${\mathbf{A}}={\mathbf{U}}\mathbf{\Sigma}{\mathbf{V}}^{\top}$ be the SVD of ${\mathbf{A}}$ . This gives

[TABLE]

Therefore,

[TABLE]

for $c_{\beta}=\frac{\left\|{{\mathbf{A}}}\right\|^{2}}{\beta}$ , and by triangle inequality and Cauchy-Schwarz

[TABLE]

for all columns ${\mathbf{b}}_{i},{\mathbf{b}}_{j}$ of ${\mathbf{B}}_{\beta}$ . By the same argument we compute

[TABLE]

giving

[TABLE]

This yields

[TABLE]

which implies

[TABLE]

∎

For small $\beta$ , the bound in Lemma B.1 is lossy. However, we can show that the coherence of ${\mathbf{B}}_{\beta}$ converges to the coherence of a conditioned version of ${\mathbf{A}}$ , for $\beta\rightarrow 0$ .

Lemma B.2.

Let ${\mathbf{A}}\in\mathbb{R}^{m\times n}$ , for $m\leq n$ , have full rank. We have $\mathrm{coh}({\mathbf{B}}_{\beta})\rightarrow\mathrm{coh}(({\mathbf{A}}{\mathbf{A}}^{\top})^{-\frac{1}{2}}{\mathbf{A}})$ , for $\beta\rightarrow 0$ .

Proof.

Define ${\mathbf{Q}}_{\beta}=\mathsf{{\mathbf{I}}{{\mathbf{d}}}}_{m}+\frac{{\mathbf{A}}{\mathbf{A}}^{\top}}{\beta}$ , so that ${\mathbf{B}}_{\beta}={\mathbf{Q}}_{\beta}^{-1/2}{\mathbf{A}}$ , and let ${\mathbf{A}}={\mathbf{U}}\mathbf{\Sigma}{\mathbf{V}}^{\top}$ be the SVD of ${\mathbf{A}}$ . Define ${\mathbf{C}}=\sqrt{\beta}({\mathbf{A}}{\mathbf{A}}^{\top})^{-\frac{1}{2}}{\mathbf{A}}$ with columns ${\mathbf{c}}_{i}$ . First note, that

[TABLE]

and

[TABLE]

Consequently,

[TABLE]

and

[TABLE]

Since we have in addition that $\|{\mathbf{c}}_{i}\|_{2}\leq\sqrt{\beta}\|({\mathbf{A}}{\mathbf{A}}^{\top})^{-\frac{1}{2}}{\mathbf{A}}\|=\sqrt{\beta}$ , $\|{\mathbf{b}}_{i}\|_{2}\leq\|{\mathbf{Q}}_{\beta}^{-\frac{1}{2}}{\mathbf{A}}\|\leq\sqrt{\beta}$ , and $\|{\mathbf{b}}_{i}\|_{2}\geq(\|{\mathbf{A}}\|^{2}+\beta)^{-\frac{1}{2}}\|{\mathbf{a}}_{i}\|_{2}\sqrt{\beta}$ , we get

[TABLE]

We conclude by noting that $\mathrm{coh}({\mathbf{C}})=\mathrm{coh}(({\mathbf{A}}{\mathbf{A}}^{\top})^{-\frac{1}{2}}{\mathbf{A}})$ . ∎

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEE Transactions on Information Theory , 56(10):5111–5130, 2010.
2[2] E. Arias-Castro and Y. C. Eldar. Noise folding in compressed sensing. IEEE Signal Processing Letters , 18(8):478–481, 2011.
3[3] M. Artina, M. Fornasier, and S. Peter. Damping noise-folding and enhanced support recovery in compressed sensing. IEEE Transactions on Signal Processing , 63(22):5990–6002, 2015.
4[4] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research , 35(2):438–457, 2010.
5[5] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming , 137(1):91–129, 2013.
6[6] H. H. Bauschke, P. L. Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces , volume 408. Springer, 2011.
7[7] A. Beck. First-order methods in optimization . SIAM, 2017.
8[8] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming , 165(2):471–507, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Computational approaches to non-convex, sparsity-inducing multi-penalty regularization

Abstract

1 Introduction

1.1 Contribution

1.2 Related Work

1.3 Notation

2 Main Results

Definition 2.1**.**

Definition 2.2**.**

Remark 2.3**.**

2.1 Augmented Formulation

Lemma 2.4**.**

Remark 2.5**.**

Lemma 2.6**.**

2.1.1 Linear Convergence

Theorem 2.7**.**

Remark 2.8**.**

2.1.2 Computational Complexity

2.2 Infimal Convolution Formulation

Lemma 2.9**.**

Lemma 2.10**.**

2.2.1 Linear Convergence

Theorem 2.11**.**

Remark 2.12**.**

Remark 2.13**.**

2.2.2 Computational Complexity

3 Numerical Experiments

3.1 Convergence Rate

Varying the Penalty Parameter.

Varying the Measurements.

3.2 Computational Comparison

Iteration Count.

Computation Time.

4 Discussion

Acknowledgment

Appendix A Proofs

A.1 Proof of Lemma 2.4

Lemma A.1**.**

Proof.

A.2 Proof of Lemma 2.6

A.3 Proof of Theorem 2.7

Lemma A.2**.**

Proof.

Lemma A.3** (Support and sign recovery).**

Proof.

Proof of Theorem 2.7.

A.4 Proof of Lemma 2.10

A.5 Proof of Theorem 2.11

Lemma A.4** (Sign and support stability).**

Proof.

Proof of Theorem 2.11.

Appendix B Coherence Bound

Lemma B.1**.**

Proof.

Lemma B.2**.**

Proof.

Definition 2.1.

Definition 2.2.

Remark 2.3.

Lemma 2.4.

Remark 2.5.

Lemma 2.6.

Theorem 2.7.

Remark 2.8.

Lemma 2.9.

Lemma 2.10.

Theorem 2.11.

Remark 2.12.

Remark 2.13.

Lemma A.1.

Lemma A.2.

Lemma A.3 (Support and sign recovery).

Lemma A.4 (Sign and support stability).

Lemma B.1.

Lemma B.2.