Accelerated Schemes for the $L_1/L_2$ Minimization

Chao Wang; Ming Yan; Yaghoub Rahimi; Yifei Lou

arXiv:1905.08946·math.NA·May 7, 2020·IEEE Trans. Signal Process.

Accelerated Schemes for the $L_1/L_2$ Minimization

Chao Wang, Ming Yan, Yaghoub Rahimi, Yifei Lou

PDF

TL;DR

This paper introduces accelerated algorithms for $L_1/L_2$ minimization in sparse recovery, demonstrating efficiency and effectiveness, especially with high dynamic range signals, and providing empirical insights into exact $L_1$ recovery.

Contribution

It proposes three new numerical algorithms for $L_1/L_2$ minimization, including two adaptive schemes that reduce computation time and analyze their convergence.

Findings

01

Algorithms are comparable to state-of-the-art methods.

02

Adaptive schemes work well with high dynamic range signals.

03

Empirical evidence suggests conditions for exact $L_1$ recovery.

Abstract

In this paper, we consider the $L_{1} / L_{2}$ minimization for sparse recovery and study its relationship with the $L_{1}$ - $α L_{2}$ model. Based on this relationship, we propose three numerical algorithms to minimize this ratio model, two of which work as adaptive schemes and greatly reduce the computation time. Focusing on two adaptive schemes, we discuss their connection to existing approaches and analyze their convergence. The experimental results demonstrate the proposed approaches are comparable to the state-of-the-art methods in sparse recovery and work particularly well when the ground-truth signal has a high dynamic range. Lastly, we reveal some empirical evidence on the exact $L_{1}$ recovery under various combinations of sparsity, coherence, and dynamic ranges, which calls for theoretical justification in the future.

Tables1

Table 1. TABLE I: Success rate (%) in solving different dynamic ranges via the L 1 subscript 𝐿 1 L_{1} model at two coherence levels F = 1 𝐹 1 F=1 and F = 20 𝐹 20 F=20 .

$F = 1$
$s$	2	6	10	14	18	22
$D = 0$	100	100	80	4	0	0
$D = 1$	100	100	80	4	0	0
$D = 2$	100	100	80	4	0	0
$D = 3$	100	100	80	4	0	0
$D = 4$	100	100	86	16	0	0
$D = 5$	100	100	88	38	12	0
$F = 20$
$s$	2	6	10	14	18	22
$D = 0$	100	100	100	100	50	0
$D = 1$	100	100	100	100	52	0
$D = 2$	100	100	100	100	52	0
$D = 3$	100	100	100	100	52	0
$D = 4$	100	100	100	100	54	0
$D = 5$	100	100	100	100	76	16

Equations111

x \in \mathds R^{n} min ∥ x ∥_{0} s.t. A x = b .

x \in \mathds R^{n} min ∥ x ∥_{0} s.t. A x = b .

x \in \mathds R^{n} min \frac{∥ x ∥ _{1}}{∥ x ∥ _{2}} s.t. A x = b .

x \in \mathds R^{n} min \frac{∥ x ∥ _{1}}{∥ x ∥ _{2}} s.t. A x = b .

L (x, y, z; v, w) =

L (x, y, z; v, w) =

+ \frac{ρ _{2}}{2} x - z + \frac{1}{ρ _{2}} w_{2}^{2},

I (t) = {0, + \infty, t = 0, otherwise .

I (t) = {0, + \infty, t = 0, otherwise .

α^{*} := x \in \mathds R^{n} in f {\frac{∥ x ∥ _{1}}{∥ x ∥ _{2}} s.t. A x = b},

α^{*} := x \in \mathds R^{n} in f {\frac{∥ x ∥ _{1}}{∥ x ∥ _{2}} s.t. A x = b},

T (α) := x \in \mathds R^{n} in f {∥ x ∥_{1} - α ∥ x ∥_{2} s.t. A x = b},

T (α) := x \in \mathds R^{n} in f {∥ x ∥_{1} - α ∥ x ∥_{2} s.t. A x = b},

x^{(k + 1)} = ar g x \in \mathds R^{n} min g (x) - ⟨ x, \nabla h (x^{(k)}) ⟩ .

x^{(k + 1)} = ar g x \in \mathds R^{n} min g (x) - ⟨ x, \nabla h (x^{(k)}) ⟩ .

g (x) = ∥ x ∥_{1} + I (A x - b) and h (x) = α ∥ x ∥_{2},

g (x) = ∥ x ∥_{1} + I (A x - b) and h (x) = α ∥ x ∥_{2},

x^{(k + 1)} = ar g x \in \mathds R^{n} min g (x) - ⟨ x, \frac{α x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ .

x^{(k + 1)} = ar g x \in \mathds R^{n} min g (x) - ⟨ x, \frac{α x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ .

{x^{(k + 1)} = ar g x min {g (x) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩}, α^{(k + 1)} = ∥ x^{(k + 1)} ∥_{1} /∥ x^{(k + 1)} ∥_{2},

{x^{(k + 1)} = ar g x min {g (x) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩}, α^{(k + 1)} = ∥ x^{(k + 1)} ∥_{1} /∥ x^{(k + 1)} ∥_{2},

{x^{(k + 1)} = ar g x min {g (x) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ + \frac{β}{2} ∥ x - x^{(k)} ∥_{2}^{2}}, α^{(k + 1)} = ∥ x^{(k + 1)} ∥_{1} /∥ x^{(k + 1)} ∥_{2} .

{x^{(k + 1)} = ar g x min {g (x) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ + \frac{β}{2} ∥ x - x^{(k)} ∥_{2}^{2}}, α^{(k + 1)} = ∥ x^{(k + 1)} ∥_{1} /∥ x^{(k + 1)} ∥_{2} .

\overset{ˉ}{x} \geq 0 min c^{T} \overset{ˉ}{x} s . t . \overset{ˉ}{A} \overset{ˉ}{x} = b,

\overset{ˉ}{x} \geq 0 min c^{T} \overset{ˉ}{x} s . t . \overset{ˉ}{A} \overset{ˉ}{x} = b,

L_{ρ} (x, y; u) = ∥ y ∥_{1} + I (A x - b) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ + \frac{β}{2} ∥ x - x^{(k)} ∥_{2}^{2} + u^{T} (x - y) + \frac{ρ}{2} ∥ x - y ∥_{2}^{2} .

L_{ρ} (x, y; u) = ∥ y ∥_{1} + I (A x - b) - ⟨ x, \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}} ⟩ + \frac{β}{2} ∥ x - x^{(k)} ∥_{2}^{2} + u^{T} (x - y) + \frac{ρ}{2} ∥ x - y ∥_{2}^{2} .

⎩ ⎨ ⎧ x_{j + 1} = ar g x min L_{ρ} (x, y_{j}; u_{j}), y_{j + 1} = ar g y min L_{ρ} (x_{j + 1}, y; u_{j}), u_{j + 1} = u_{j} + ρ (x_{j + 1} - y_{j + 1}),

⎩ ⎨ ⎧ x_{j + 1} = ar g x min L_{ρ} (x, y_{j}; u_{j}), y_{j + 1} = ar g y min L_{ρ} (x_{j + 1}, y; u_{j}), u_{j + 1} = u_{j} + ρ (x_{j + 1} - y_{j + 1}),

x - \frac{β x ^{(k)} - u _{j} + ρ y _{j} + \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}}}{β + ρ}_{2}^{2},

x - \frac{β x ^{(k)} - u _{j} + ρ y _{j} + \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}}}{β + ρ}_{2}^{2},

proj (z) = z - A^{T} (A A^{T})^{- 1} (A z - b),

proj (z) = z - A^{T} (A A^{T})^{- 1} (A z - b),

x_{j + 1} = proj \frac{β x ^{(k)} - u _{j} + ρ y _{j} + \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}}}{β + ρ} .

x_{j + 1} = proj \frac{β x ^{(k)} - u _{j} + ρ y _{j} + \frac{α ^{(k)} x ^{(k)}}{∥ x ^{(k)} ∥ _{2}}}{β + ρ} .

y_{j + 1} = ar g y min {∥ y ∥_{1} + \frac{ρ}{2} y - x_{j + 1} - \frac{u _{j}}{ρ}_{2}^{2}} .

y_{j + 1} = ar g y min {∥ y ∥_{1} + \frac{ρ}{2} y - x_{j + 1} - \frac{u _{j}}{ρ}_{2}^{2}} .

y_{j + 1} = shrink (x_{j + 1} + \frac{u _{j}}{ρ}, \frac{1}{ρ}),

y_{j + 1} = shrink (x_{j + 1} + \frac{u _{j}}{ρ}, \frac{1}{ρ}),

{Update {x^{(k + 1)}, α^{(k + 1)}} by \eqref e q u : a 1 Update {x^{(k + 1)}, α^{(k + 1)}} by \eqref e q u : a 2 for A1 for A2

{Update {x^{(k + 1)}, α^{(k + 1)}} by \eqref e q u : a 1 Update {x^{(k + 1)}, α^{(k + 1)}} by \eqref e q u : a 2 for A1 for A2

{x^{(k + 1)} = ar g x min f (x, α^{(k)}), α^{(k + 1)} = l (x^{(k + 1)}, α^{(k)}),

{x^{(k + 1)} = ar g x min f (x, α^{(k)}), α^{(k + 1)} = l (x^{(k + 1)}, α^{(k)}),

x_{j + 1} = Ψ (x_{j}, α^{(k)}),

x_{j + 1} = Ψ (x_{j}, α^{(k)}),

x_{j + 1} = Ψ (x_{j}, α_{j + 1}),

x_{j + 1} = Ψ (x_{j}, α_{j + 1}),

x_{j + 1} = Ψ (x_{j}, α_{j})

x_{j + 1} = Ψ (x_{j}, α_{j})

B x^{(k + 1)} = x^{(k)} .

B x^{(k + 1)} = x^{(k)} .

q (x) = \frac{⟨ x , B x ⟩}{∥ x ∥ _{2}^{2}} .

q (x) = \frac{⟨ x , B x ⟩}{∥ x ∥ _{2}^{2}} .

x^{(k + 1)} = ar g x min {\frac{1}{2} ⟨ x, B x ⟩ - ⟨ x^{(k)}, x ⟩} .

x^{(k + 1)} = ar g x min {\frac{1}{2} ⟨ x, B x ⟩ - ⟨ x^{(k)}, x ⟩} .

x^{(k + 1)} = ar g x min {r (x) - ⟨ \nabla s (x^{(k)}), x ⟩} .

x^{(k + 1)} = ar g x min {r (x) - ⟨ \nabla s (x^{(k)}), x ⟩} .

⎩ ⎨ ⎧ x^{(k + 1)} = ar g x min {r (x) - λ^{(k)} ⟨ \nabla s (x^{(k)}), x ⟩}, λ^{(k + 1)} = \frac{r ( x ^{(k + 1)} )}{s ( x ^{(k + 1)} )} .

⎩ ⎨ ⎧ x^{(k + 1)} = ar g x min {r (x) - λ^{(k)} ⟨ \nabla s (x^{(k)}), x ⟩}, λ^{(k + 1)} = \frac{r ( x ^{(k + 1)} )}{s ( x ^{(k + 1)} )} .

\left\{\begin{array}[]{l}0\in\frac{\partial\|\mathbf{x}^{\ast}\|_{1}}{\|\mathbf{x}^{\ast}\|_{2}}-\frac{\|\mathbf{x}^{\ast}\|_{1}}{\|\mathbf{x}^{\ast}\|_{2}^{2}}\frac{\mathbf{x}^{\ast}}{\|\mathbf{x}^{\ast}\|_{2}}+A^{T}\mathbf{s},\\ 0=A\mathbf{x}^{\ast}-\mathbf{b}.\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Accelerated Schemes for the $L_{1}/L_{2}$ Minimization

Chao Wang, Ming Yan, Yaghoub Rahimi, Yifei Lou C. Wang and Y. Lou are with the Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX 75080 USA (E-mail: [email protected], [email protected]). Y. Lou was partially supported by NSF Awards DMS 1522786 and 1846690. Y. Rahimi is with the School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332 USA (E-mail: [email protected]).M. Yan is with the Department of Computational Mathematics, Science and Engineering (CMSE) and the Department of Mathematics, Michigan State University, East Lansing, MI, 48824 USA (Email: [email protected]). M. Yan was partially supported by NSF award DMS 1621798.

Abstract

In this paper, we consider the $L_{1}/L_{2}$ minimization for sparse recovery and study its relationship with the $L_{1}$ - $\alpha L_{2}$ model. Based on this relationship, we propose three numerical algorithms to minimize this ratio model, two of which work as adaptive schemes and greatly reduce the computation time. Focusing on the two adaptive schemes, we discuss their connection to existing approaches and analyze their convergence. The experimental results demonstrate that the proposed algorithms are comparable to state-of-the-art methods in sparse recovery and work particularly well when the ground-truth signal has a high dynamic range. Lastly, we reveal some empirical evidence on the exact $L_{1}$ recovery under various combinations of sparsity, coherence, and dynamic ranges, which calls for theoretical justification in the future.

Index Terms:

Sparsity, $L_{0}$ , adaptive scheme, dynamic range.

I Introduction

In various science and engineering applications, one aims to seek for a low-dimensional representation from high-dimensional data, and sparsity is a crucial assumption. For example, it is reasonable to assume in machine learning [1] that only a few features correspond to the response. In image processing [2], the restored images are often piecewise constant, which means that gradients are sparse. In non-negative matrix factorization [3], the low-rank decomposition enforces sparsity with respect to singular values.

Sparse signal recovery is to find the sparsest solution of $A\mathbf{x}=\mathbf{b}$ where $A\in\mathds{R}^{m\times n}$ ( $m\ll n$ ), $\mathbf{x}\in\mathds{R}^{n}$ , and $\mathbf{b}\in\mathds{R}^{m}$ . We assume that $A$ has a full row rank and $\mathbf{b}$ is nonzero. This problem is often referred to as compressed sensing (CS) [4, 5] in the sense that the sparse signal $\mathbf{x}$ is compressible. Mathematically, it can be formulated by the $L_{0}$ minimization,

[TABLE]

Unfortunately, the $L_{0}$ problem is known to be NP-hard [6]. Various approaches in sparse recovery have been investigated. Some greedy methods include orthogonal matching pursuit (OMP) [7], orthogonal least squares (OLS) [8], and compressive sampling matching pursuit (CoSaMp) [9]. However, these greedy methods often lack of accuracy when $n$ is large. Alternatively, approximations/relaxation approaches to the $L_{0}$ norm have been sought. For example, convex relaxation, referred to as basis pursuit (BP) [10], replaces $L_{0}$ in (1) with the $L_{1}$ norm. Recently, nonconvex models attract considerate amount of attentions due to their sharper approximations of $L_{0}$ compared to the $L_{1}$ norm. Some popular nonconvex models include $L_{p}$ [11, 12, 13], $L_{1}$ - $L_{2}$ [14, 15], transformed $L_{1}$ (TL1) [16, 17, 18], nonnegative garrote [19], and capped- $L_{1}$ [20, 21, 22]. Except for $L_{1}$ - $L_{2}$ , all of these nonconvex models involve one parameter to be determined and adjusted for different types of sparse recovery problems.

In this paper, we study the ratio of $L_{1}$ and $L_{2}$ as a scale-invariant and parameter-free metric to approximate the desired scale-invariant $L_{0}$ norm. The ratio of $L_{1}$ and $L_{2}$ can be traced back to [23] as a sparsity measure, and its scale-invariant property was explicitly mentioned in [24]. Esser et al. [25, 14] focused on nonnegative signals and established the equivalence between $L_{1}/L_{2}$ and $L_{0}$ . The ratio model was later formulated as a nonlinear constraint that was solved by a lifted approach [26, 27]. Some applications of $L_{1}/L_{2}$ include blind deconvolution [28, 29] and sparse filtering [30, 31].

In our earlier work [32], we focused on a constrained minimization problem,

[TABLE]

Theoretically, we proved that any $s$ -sparse vector is a local minimizer of the $L_{1}/L_{2}$ model provided with a strong null space property (sNSP) condition. Computationally, we considered to minimize (2) via the alternating direction method of multipliers (ADMM) [33]. In particular, we introduced two auxiliary variables and formed the augmented Lagrangian as

[TABLE]

where $I(\cdot)$ is defined as

[TABLE]

There is a closed-form solution for each sub-problem. Please refer to [32] for more details.

This paper contributes three schemes to minimize (2). We demonstrate in experiments that the new schemes are computationally more efficiently compared to the previous ADMM approach. The novelties of the paper are three-fold:

(1)

Thanks to the new schemes, $L_{1}/L_{2}$ can effectively deal with sparse signals with a high dynamic range, which is not the case for the ADMM approach; 2. (2)

We reveal the connection of the proposed schemes to existing approaches, which helps to establish the convergence; 3. (3)

Our empirical results shed light about the effects of sparsity, coherence, and dynamic range on sparse recovery, which is new in the CS literature.

The rest of the paper is organized as follows. Section II is devoted to theoretical analysis on the relation between $L_{1}/L_{2}$ and $L_{1}$ - $\alpha L_{2}$ , which motivates three numerical schemes to minimize $L_{1}/L_{2}$ . We interpret the proposed schemes in line with some existing approaches in Section III, followed by convergence analysis in Section IV. We conduct extensive experiments in Section V to demonstrate the performance of the $L_{1}/L_{2}$ model with three minimizing algorithms over state-of-the-art methods in sparse recovery. Section VI presents how the classic $L_{1}$ approach behaves under different dynamic ranges and how sparsity, coherence, and dynamic range interplay on sparse recovery. Finally, conclusions and future works are given in Section VII.

II Numerical schemes

We establish in 1 a link between the constrained $L_{1}/L_{2}$ formulation (2) and $L_{1}$ - $\alpha L_{2}$ , where $\alpha$ is a positive parameter. Immediately following this proposition, we develop a numerical algorithm for minimizing the ratio model. We further discuss two accelerated approaches in Section II-B.

Proposition 1.

Denote

[TABLE]

and

[TABLE]

then we have

(a)

if $T(\alpha)<0$ , then $\alpha>\alpha^{\ast}$ ; 2. (b)

if $T(\alpha)\geq 0$ , then $\alpha\leq\alpha^{\ast}$ ; 3. (c)

if $T(\alpha)=0$ , then $\alpha=\alpha^{\ast}.$

Proof.

Denote the feasible set of (5) by $\mathbf{F}=\{\mathbf{x}\mid A\mathbf{x}=\mathbf{b}\}$ . Since $\mathbf{b}\neq 0$ then $\mathbf{0}\notin\mathbf{F}$ .

(a)

If $T(\alpha)<0$ , then there exists $\mathbf{x}\in\mathbf{F}$ such that $\|\mathbf{x}\|_{1}-\alpha\|\mathbf{x}\|_{2}<0$ , which implies that $\alpha>\frac{\|\mathbf{x}\|_{1}}{\|\mathbf{x}\|_{2}}$ . Therefore, we have $\alpha>\alpha^{*}$ . 2. (b)

If $T(\alpha)\geq 0$ , then for all $\mathbf{x}\in\mathbf{F}$ we have $\|\mathbf{x}\|_{1}-\alpha\|\mathbf{x}\|_{2}\geq 0$ . So $\alpha\leq\frac{\|\mathbf{x}\|_{1}}{\|\mathbf{x}\|_{2}}$ and hence $\alpha\leq\inf\limits_{x\in\mathbf{F}}\frac{\|\mathbf{x}\|_{1}}{\|\mathbf{x}\|_{2}}=\alpha^{*}$ , i.e., $\alpha\leq\alpha^{*}$ . 3. (c)

If $T(\alpha)=0,$ then by part (b) we get $\alpha\leq\alpha^{*}$ . Furthermore, there exists a sequence $\{\mathbf{x}_{n}\}\subset\mathbf{F}$ such that $\lim\limits_{n\to\infty}\left(\|\mathbf{x}_{n}\|_{1}-\alpha\|\mathbf{x}_{n}\|_{2}\right)=0$ . Since $\mathbf{x}_{n}\in\mathbf{F}$ , we have $\|\mathbf{b}\|=\|A\mathbf{x}_{n}\|\leq\|A\|\|\mathbf{x}_{n}\|$ . Hence, $\{\mathbf{x}_{n}\}$ has a lower bounded, i.e. $\|\mathbf{x}_{n}\|\geq\|\mathbf{b}\|/\|A\|$ for all $n$ , then we get $\lim\limits_{n\to\infty}\left(\|\mathbf{x}_{n}\|_{1}-\alpha\|\mathbf{x}_{n}\|_{2}\right)/\|\mathbf{x}_{n}\|_{2}=0$ , which means $\alpha^{*}\leq\lim\limits_{n\rightarrow\infty}\|\mathbf{x}_{n}\|_{1}/\|\mathbf{x}_{n}\|_{2}=\alpha$ . Therefore, we have $\alpha=\alpha^{*}$ .

∎

II-A Bisection Search

It follows from 1 that the optimal value of $L_{1}/L_{2}$ equals to the value of $\alpha$ in the $L_{1}$ - $\alpha L_{2}$ model if the objective value of $L_{1}$ - $\alpha L_{2}$ is zero. That is to say, the optimal value of the ratio model is the root of $T(\alpha)$ , which can be obtained by bisection search. Moreover, we have upper/lower bounds of $\alpha$ , i.e., $\alpha\in[1,\sqrt{n}]$ , since $\|\mathbf{x}\|_{2}\leq\|\mathbf{x}\|_{1}\leq\sqrt{n}\|\mathbf{x}\|_{2},\ \forall\mathbf{x}\in\mathds{R}^{n}$ [34]. The procedure goes as follows: we start with an initial range of $\alpha$ to be $[1,\ \sqrt{n}]$ and an initial value of $\alpha^{(0)}$ in between. Then using this $\alpha^{(0)}$ , we solve for the $L_{1}$ - $\alpha^{(0)}L_{2}$ minimization via the difference-of-convex algorithm (DCA) [35]; more details on the DCA implementation will be given in Section II-B. Based on the objective value of $T(\alpha^{(0)})$ , we update the range of $\alpha$ . Specifically if $T(\alpha^{(0)})=0$ , then we find the minimum ratio and the corresponding minimizer $\mathbf{x}^{\ast}$ in the $L_{1}$ - $L_{2}$ model is also the minimizer of the $L_{1}/L_{2}$ model. If $T(\alpha^{(0)})>0$ , then we update the range as $[\alpha^{(0)},\ \sqrt{n}].$ If $T(\alpha^{(0)})<0$ , then the minimum ratio is smaller than $\alpha^{(0)}$ , so we can shorten the range from $[1,\ \sqrt{n}]$ to $[1,\ \alpha^{(0)}].$ We can further shorten the internal as $\left[1,\ \frac{\|\mathbf{x}^{(k+1)}\|_{1}}{\|\mathbf{x}^{(k+1)}\|_{2}}\right],$ as the objective value of $L_{1}$ - $\frac{\|\mathbf{x}^{(k+1)}\|_{1}}{\|\mathbf{x}^{(k+1)}\|_{2}}L_{2}$ would be less than or equal to zero in the next iteration. After the range is updated, we choose $\alpha^{(1)}$ using the middle point of two end points and iterate.

We summarize the entire process as Algorithm 1, in which the stopping criterion is that the error between two adjacent $\alpha$ values is small enough. As the algorithmic scheme follows directly from bisection search, we refer the algorithm as $L_{1}/L_{2}$ -BS or BS if the context is clear. The convergence of BS can be obtained in the same way that the bisection method converges. However, due to the nonconvex nature of the $L_{1}$ - $\alpha L_{2}$ minimization (6), there is no guarantee to find its global minimizer and hence the solution to (5) may be suboptimal.

II-B Adaptive Algorithms

The BS algorithm is computationally expensive, considering that the $L_{1}$ - $\alpha L_{2}$ minimization is conducted for multiple times. To speed up, we discuss two variants of $L_{1}/L_{2}$ -BS by updating the parameter $\alpha$ iteratively while minimizing $\|\mathbf{x}\|_{1}-\alpha\|\mathbf{x}\|_{2}$ .

Following the DCA framework [36, 37] to minimize $\|\mathbf{x}\|_{1}-\alpha\|\mathbf{x}\|_{2}$ , we consider the objective function as the difference of two convex functions, i.e., $\min\limits_{\mathbf{x}\in\mathds{R}^{n}}g(\mathbf{x})-h(\mathbf{x}).$ By linearizing the second term $h(\cdot)$ , the DCA iterates as follows,

[TABLE]

Particularly for the $L_{1}$ - $\alpha L_{2}$ model, we have

[TABLE]

thus leading to the DCA update as

[TABLE]

Now we consider to update $\alpha$ iteratively by the ratio of the current solution, leading to the following scheme,

[TABLE]

where $g$ is defined in (8). Notice that the $\mathbf{x}$ -subproblem in (10) is a linear programming (LP) problem, which unfortunately has no guarantee that the optimal solution exists (as the problem can be unbounded). To increase the robustness of the algorithm, we further incorporate a quadratic term into the linear problem, i.e.,

[TABLE]

We denote these two adaptive methods (10) and (11) as $L_{1}/L_{2}$ -A1 and $L_{1}/L_{2}$ -A2, respectively or A1 and A2 for short. Both algorithms are summarized in Algorithm 2.

For the $\mathbf{x}$ subproblem of $L_{1}/L_{2}$ -A1, we convert it into an LP problem. Assume that $\mathbf{x}=\mathbf{x}^{+}-\mathbf{x}^{-}$ where $\mathbf{x}^{+}\geq\mathbf{0}$ and $\mathbf{x}^{-}\geq\mathbf{0}.$ Denote $\bar{\mathbf{x}}=\begin{bmatrix}\mathbf{x}^{+}\\ \mathbf{x}^{-}\end{bmatrix},$ then $A\mathbf{x}=\mathbf{b}$ becomes $\bar{A}\bar{\mathbf{x}}=\mathbf{b}$ with $\bar{A}=\begin{bmatrix}A&-A\end{bmatrix}$ . Therefore, the $\mathbf{x}$ -subproblem becomes

[TABLE]

where $\mathbf{c}=\left[\mathbf{1}+\frac{\alpha^{(k)}\mathbf{x}^{(k)}}{\|\mathbf{x}^{(k)}\|_{2}};\mathbf{1}-\frac{\alpha^{(k)}\mathbf{x}^{(k)}}{\|\mathbf{x}^{(k)}\|_{2}}\right]$ . We adopt the software Gurobi [38] to solve this LP problem.

The $\mathbf{x}$ subproblem of $L_{1}/L_{2}$ -A2 is a quadratic programming problem, which can be solved via ADMM. By introducing an auxiliary variable $\mathbf{y}$ , we have the augmented Lagrangian,

[TABLE]

Then the ADMM iteration goes as follows

[TABLE]

where the subscript $j$ indexes the inner loop, as opposed to the superscript $k$ for outer iterations used in (11). The $\mathbf{x}$ -subproblem of (14) is a projection problem to minimize

[TABLE]

under the constraint of $A\mathbf{x}=\mathbf{b}$ . Since the closed-form solution of projecting a vector $\mathbf{z}$ to this constraint is

[TABLE]

the $\mathbf{x}$ -update is given by

[TABLE]

The $\mathbf{y}$ -subproblem of (14) is equivalent to

[TABLE]

It has a closed-form solution via soft shrinkage, i.e.,

[TABLE]

with $\mathbf{shrink}(\mathbf{v},\mu)=\mathrm{sign}(\mathbf{v})\max\left(|\mathbf{v}|-\mu,0\right).$

III Connections to previous works

We try to interpret the proposed adaptive methods (A1 and A2) in line with some existing approaches: parameter selection, generalized inverse power, and gradient-based methods. Our efforts contribute to convergence analysis in Section IV.

III-A Parameter Selection

Recall that in $L_{1}/L_{2}$ -BS, the ratio $L_{1}/L_{2}$ is minimized when there exists a proper $\alpha^{\ast}$ such that $\|\mathbf{x}^{\ast}\|_{1}-\alpha^{\ast}\|\mathbf{x}^{\ast}\|_{2}=0$ with $\mathbf{x}^{\ast}=\arg\min\limits_{\mathbf{x}}\left\{\|\mathbf{x}\|_{1}-\alpha^{\ast}\|\mathbf{x}\|_{2}\ \mathrm{s.t.}\ A\mathbf{x}=\mathbf{b}\right\}$ . We can regard this process as a root-finding problem for $\alpha^{\ast}$ , which often occurs in parameter selection. For example, in the discrepancy principle method [39, 40, 41], one aims to find a parameter $\alpha$ such that the resulting data-fitting term is close to the noise level. In particular, we represent this process by

[TABLE]

where $f(\cdot)$ is a general objective function to be minimized and $l(\cdot)$ is a certain scheme to update $\alpha$ so that discrepancy principle holds. Typically, an inner loop is required to find the solution of $\mathbf{x}$ -subproblem, followed by updating this parameter in an outer iteration. We further present the $j$ -th inner iteration at the $k$ -th outer iteration by

[TABLE]

for the $\mathbf{x}$ -subproblem in (17).

To speed-up the process, Wen and Chan [40] proposed an adaptive scheme that updates the parameter during the inner loop such that it renders the current data-fitting term equal to the noise level. In other words, instead of updating $\alpha$ after minimizing $f$ , they directly iterated

[TABLE]

in a way that $\{\mathbf{x}_{j+1},\alpha_{j+1}\}$ satisfies the discrepancy principle. In this way, only one loop is needed as opposed to inner/outer loops in (18). But it requires a closed-form solution for $\mathbf{x}_{j+1}$ so one can perform a one-dimensional search for $\alpha_{j+1}$ .

The proposed BS scheme falls into the framework of (17) in that the searching range of parameter is shorten every outer iteration. However, $f$ in our BS method is the $L_{1}$ - $\alpha L_{2}$ minimization that does not have a closed-form solution. As opposed to (19), we consider to update

[TABLE]

prior to updating $\alpha$ . In other word, we update $\mathbf{x}_{j+1}$ based on $\alpha_{j}$ rather than $\alpha_{j+1}$ , the latter of which was adopted in the parameter-selection method [40]. The rationale of (20) is to guarantee that $\{\mathbf{x}_{j+1},\alpha_{j+1}\}$ satisfies $\|\mathbf{x}_{j+1}\|_{1}-\alpha_{j+1}\|\mathbf{x}_{j+1}\|_{1}=0$ . The iterative scheme (20) is consistent with A1 or A2 (depending on the form of $\Psi$ ), if we change the notation from subscript $j$ to superscript $k$ .

III-B Generalized Inverse Power Methods

A standard technique to find the smallest eigenvalue of a positive semi-definite symmetric matrix $B$ is the inverse power method [34] that requires to iteratively solve the linear system,

[TABLE]

The iteration converges to the smallest eigenvector of $B$ , denoted by $\mathbf{x}^{\ast}$ . Then the smallest eigenvalue can be evaluated by $\lambda=q(\mathbf{x}^{\ast})$ , where $q(\cdot)$ is Rayleigh quotient defined as

[TABLE]

Note that (21) is equivalent to the minimization problem

[TABLE]

It is well known in linear algebra [34, 42] that eigenvectors of $B$ are critical points of $\min\limits_{\mathbf{x}}q(\mathbf{x})$ and the smallest eigenvalue/eigenvector can be found by (22). This idea is naturally extended to the nonlinear case in [43], where a general quotient is considered, $q(\mathbf{x})=\frac{r(\mathbf{x})}{s(\mathbf{x})},$ with arbitrary functions $r(\cdot)$ and $s(\cdot)$ . Similarly to (22), we have the corresponding scheme

[TABLE]

Following [43], we consider to update the eigenvalue $\lambda^{(k)}$ at each iteration to guarantee the algorithm’s descent. In particular, the iterative scheme is given by

[TABLE]

If we choose $r(\mathbf{x})=g(\mathbf{x}),\ s(\mathbf{x})=\|\mathbf{x}\|_{2},$ and denote $\lambda$ as $\alpha$ , then the generalized inverse power method (23) is $L_{1}/L_{2}$ -A1. In [44], a modified inverse power method was proposed via the steepest descent flow. The iteration scheme is to incorporate a quadratic term in the objective function of the $\mathbf{x}$ -subproblem, which leads to $L_{1}/L_{2}$ -A2.

III-C Gradient-based Methods

Definition 1.

A critical point of a constrained optimization problem is a vector in the feasible set (satisfying the constraints) that is also a local maximum, minimum, or saddle point of the objective function.

According to Karush-Kuhn-Tucker (KKT) conditions, $\mathbf{x}^{\ast}\neq\mathbf{0}$ is a critical point of (2) if and only if there exists a vector $\mathbf{s}$ such that

[TABLE]

By introducing $\hat{\mathbf{s}}=\|\mathbf{x}^{\ast}\|_{2}\mathbf{\cdot}\mathbf{s}$ , we have

[TABLE]

The condition (25) is also an optimality condition to another optimization problem:

[TABLE]

where $g(\mathbf{x})$ is from (8) and $w(\mathbf{x})$ is some function satisfying

[TABLE]

Note that $w(\cdot)$ can not be explicitly determined from (27).

By applying a proximal gradient method (PGM) [45, 46, 47] on the model (26), we obtain the following scheme

[TABLE]

where $\mathbf{prox}_{g}(\mathbf{y})=\arg\min\limits_{\mathbf{z}}\left\{g(\mathbf{z})+\frac{1}{2}\|\mathbf{z}-\mathbf{y}\|_{2}^{2}\right\}.$ This iterative scheme is the same as $L_{1}/L_{2}$ -A2.

As for $L_{1}/L_{2}$ -A1, we can interpret it as a generalized conditional gradient method [48] that minimizes $g(\mathbf{x})+w(\mathbf{x})$ by $\mathbf{x}^{(k+1)}=\min\limits_{\mathbf{y}}\langle\nabla w(\mathbf{x}^{(k)}),\mathbf{y}\rangle+g(\mathbf{y}).$

IV Convergence analysis

Following the discussion in Section III-C, we present the convergence analysis. We start with the convergence of A2, which is characterized in Theorem 1. To prove it, we need four lemmas, whose proofs are given in Appendix.

Lemma 1.

(Sufficient decreasing) The sequence $\{\mathbf{x}^{(k)},\alpha^{(k)}\}$ produced by $L_{1}/L_{2}$ -A2 satisfies

[TABLE]

The next two lemmas (Lemma 2 and Lemma 3) discuss the Lipschitz properties.

Lemma 2.

Define $L=\frac{1}{\|A^{T}(AA^{T})^{-1}\mathbf{b}\|_{2}}$ . Then for any $\mathbf{x},\mathbf{y}\in\mathds{R}^{n}$ satisfying $A\mathbf{x}=A\mathbf{y}=\mathbf{b}$ , we have

[TABLE]

Since the gradient of the $L_{2}$ norm is $\nabla\|\mathbf{x}\|_{2}=\frac{\mathbf{x}}{\|\mathbf{x}\|_{2}}$ , Lemma 2 implies that the gradient of Euclidean norm is Lipschitz-continuous in the domain $\{\mathbf{x}\ |\ A\mathbf{x}=\mathbf{b}\}$ . The next lemma is about the Lipschitz property for the implicit function $w(\cdot)$ that satisfies (27).

Lemma 3.

Given $L$ defined in Lemma 2. For any $\mathbf{x},\mathbf{y}\in\mathds{R}^{n}$ satisfying $A\mathbf{x}=A\mathbf{y}=\mathbf{b}$ , then

[TABLE]

for $w$ satisfying (27) and $L_{w}=2\sqrt{n}L$ .

Lemma 4.

Given $g(\cdot)$ defined in (8) and suppose $w(\cdot)$ satisfies (27), we denote

[TABLE]

for an arbitrary $\beta>0$ . Then we have

(a)

$\Phi(\mathbf{x}^{\ast})=\mathbf{0}$ if and only if $\mathbf{x}^{\ast}$ is a critical point of (2); 2. (b)

$\left\|\Phi(\mathbf{x})-\Phi(\mathbf{y})\right\|_{2}\leq L_{\Phi}\|\mathbf{x}-\mathbf{y}\|_{2}$ with $L_{\Phi}=L_{w}+2\beta,$ for any $\mathbf{x},\mathbf{y}\in\mathds{R}^{n}$ satisfying $A\mathbf{x}=A\mathbf{y}=\mathbf{b}$ .

It is stated in (28) that $L_{1}/L_{2}$ -A2 can be expressed as $\mathbf{x}^{(k+1)}=\mathrm{prox}_{\frac{1}{\beta}g}\left(\mathbf{x}^{(k)}-\frac{1}{\beta}\nabla w(\mathbf{x}^{(k)})\right)$ . By the definition of $\Phi(\cdot)$ in (30) and the decreasing property of $\|\mathbf{x}\|_{1}/\|\mathbf{x}\|_{2}$ in Lemma 1, we can interpret A2 as a gradient descent method

[TABLE]

In the following theorem, we rely on Lemma 4 to show that the descent direction along $\Phi(\cdot)$ leads to convergence.

Theorem 1.

Given a sequence $\{\mathbf{x}^{(k)},\alpha^{(k)}\}$ generated by $L_{1}/L_{2}$ -A2. If $\{\mathbf{x}^{(k)}\}$ is bounded, there exists a subsequence that converges to a critical point of the ratio model (2).

Proof.

According to Lemma 1, we know that $\alpha^{(k)}$ is decreasing and bounded from below, so there exists a scalar $\alpha^{\ast}$ such that $\alpha^{(k)}\rightarrow\alpha^{\ast}$ . With the boundedness assumption of $\mathbf{x}$ , we get $\|\mathbf{x}^{(k+1)}-\mathbf{x}^{(k)}\|_{2}\rightarrow 0$ from Lemma 1, which implies that $\|\Phi(\mathbf{x}^{(k)})\|_{2}\rightarrow 0$ . The boundedness of $\mathbf{x}^{(k)}$ also leads to a convergent subsequence, i.e., $\mathbf{x}^{(k_{i})}\rightarrow\mathbf{x}^{\ast}.$ Therefore, we have

[TABLE]

As $k_{i}\rightarrow\infty$ , we get $\|\Phi(\mathbf{x}^{\ast})\|_{2}=0$ and hence $\Phi(\mathbf{x}^{\ast})=\mathbf{0}.$ By Lemma 4, $\{\mathbf{x}^{(k_{i})}\}$ converges to a critical point. ∎

Remark 1.

Theorem 1* does not require that the step-size $\frac{1}{\beta}$ is small, which is typically for gradient-based methods. In our numerical tests, we can choose small $\beta$ and get good results. *

Theorem 2.

Given a sequence $\{\mathbf{x}^{(k)},\alpha^{(k)}\}$ generated by $L_{1}/L_{2}$ -A1. If $\{\mathbf{x}^{(k)}\}$ is bounded, it has a convergent subsequence.

Proof.

Denote

[TABLE]

Since $z(\mathbf{x}^{(k)},\mathbf{x}^{(k)})=0$ by the definition of $\alpha^{(k)}$ , the minimal value of $z(\mathbf{x},\mathbf{x}^{(k)})$ subject to the constraint $\{\mathbf{x}\ |\ A\mathbf{x}=\mathbf{b}\}$ is less than or equal to zero. Specifically, $z(\mathbf{x}^{(k+1)},\mathbf{x}^{(k)})\leq 0.$ As a result, by Cauchy-Schwarz inequality, we have

[TABLE]

which implies $\alpha^{(k+1)}\leq\alpha^{(k)}$ . Since $\alpha^{(k)}\in[1,\sqrt{n}]$ , the decreasing sequence of $\alpha^{(k)}$ converges, i.e., $\alpha^{(k)}\rightarrow\alpha^{\ast}$ . By the boundedness of $\mathbf{x}^{(k)}$ , it has a convergent subsequence, i.e, there exists a vector $\mathbf{x}^{\ast}$ such that $\mathbf{x}^{(k_{i})}\rightarrow\mathbf{x}^{\ast}$ . ∎

Remark 2.

The sufficient decrease property (Lemma 1) does not hold for $\beta=0$ when $L_{1}/L_{2}$ -A2 reduces to A1. So, we cannot show that A1 converges to a critical point.

Remark 3.

According to Theorem 1 and Theorem 2, we prove that either both algorithms diverge due to unboundedness or there exists a convergent subsequence. It is possible that the solution can be unbounded. For example, $A$ has a zero-column, then the corresponding entry can take $+\infty$ so that the ratio of $L_{1}$ and $L_{2}$ is minimized. In the numerical tests, we demonstrate empirically that $\{\mathbf{x}^{(k)}\}$ is always bounded and hence convergent for general (random) matrices $A$ .

V Numerical experiments

In this section, we compare the proposed algorithms with state-of-the-art methods in sparse recovery. All the numerical experiments are conducted on a desktop with CPU (Intel i7-6700, 3.4GHz) and $\mathrm{MATLAB\ 9.2\ (R2017a)}.$

We focus on the sparse recovery problem with highly coherent matrices, on which standard $L_{1}$ models fail. Following the works of [15, 49, 50], we consider an oversampled discrete cosine transform (DCT), defined as $A=[\mathbf{a}_{1},\mathbf{a}_{2},\cdots,\mathbf{a}_{n}]\in\mathds{R}^{m\times n}$ with

[TABLE]

where $\mathbf{w}$ is a random vector that is uniformly distributed in $[0,1]^{m}$ and $F\in\mathds{R}$ is a positive parameter to control the coherence in a way that a larger $F$ yields a more coherent matrix. Throughout the experiments, we consider over-sampled DCT matrices of size $64\times 1024$ . The ground truth $\mathbf{x}\in\mathds{R}^{n}$ is simulated as an $s$ -sparse signal, where $s$ is the number of nonzero entries. As suggested in [50], we require a minimum separation at least $2F$ in the support of $\mathbf{x}$ . As for the values of non-zero elements, we follow the work of [51] to consider sparse signals with a high dynamic range. Define the dynamic range of a signal $\mathbf{x}$ as $\Theta(\mathbf{x})=\frac{\max\{|\mathbf{x}_{s}|\}}{\min\{|\mathbf{x}_{s}|\}},$ which can be controlled by an exponential factor $D$ . In particular, we simulate $\mathbf{x}_{s}$ by the following MATLAB command,

[TABLE]

In the experiments, we set $D=3$ and $5$ , corresponding to $\Theta\approx 10^{3}$ and $10^{5}$ , respectively. Note that randn and rand are the $\mathrm{MATLAB}$ commands for the Gaussian distribution $\mathcal{N}(0,1)$ and the uniform distribution $\mathcal{U}(0,1)$ , respectively. To compare with our previous work [32] of the $L_{1}/L_{2}$ minimization, we also consider that the nonzero elements follow the Gaussian distribution, i.e., $(\mathbf{x}_{s})_{i}\sim\mathcal{N}(0,1),i=1,2,\cdots,s.$

The fidelity of sparse signal recovery is assessed in terms of success rate, defined as the number of successful trials over the total number of trials. When the relative error between the ground truth $\mathbf{x}$ and the reconstructed solution $\mathbf{x}^{\ast}$ , i.e., $\frac{\|\mathbf{x}^{\ast}-\mathbf{x}\|_{2}}{\|\mathbf{x}\|_{2}},$ is less than $10^{-3}$ , we declare it as a success. Moreover, we categorize the failure of not recovering the ground-truth signal as model/algorithm failures and by comparing the objective function $f(\cdot)$ at the ground truth $\mathbf{x}$ and at the restored solution $\mathbf{x}^{\ast}$ . If $f(\mathbf{x})>f(\mathbf{x}^{\ast})$ , then $\mathbf{x}$ is not a global minimizer of the model, so we regard it as a model failure. If $f(\mathbf{x})<f(\mathbf{x}^{\ast})$ , then the algorithm does not reach a global minimizer. It is referred to as an algorithm failure. Similarly to success rates, we can define model-failure rates and algorithm-failure rates.

V-A Algorithmic Comparison

We present various computational aspects of the proposed algorithms, i.e., BS, A1, and A2, together with comparison to our previous ADMM approach [32]. First of all, we attempt to demonstrate the convergence of all proposed algorithms using an example of $s=15$ , $F=15$ (so the minimal separation is 30), and nonzero elements following Gaussian distribution. Since the ratio model is solved via the $L_{1}$ - $\alpha L_{2}$ model, we plot the values of $\|\mathbf{x}^{(k)}\|_{1}-\alpha^{(k-1)}\|\mathbf{x}^{(k)}\|_{2}$ and $\alpha^{(k)}$ versus iteration counter $k$ in Figure 1. For $L_{1}/L_{2}$ -BS, we record the value at each outer iteration and the stopping conditions are either the maximum outer iteration reaches 10 or $|\alpha^{(k)}-\alpha^{(k-1)}|\leq 10^{-2}$ . For each iteration of A1, A2, and the inner loop of BS, the stopping criterions are the relative error $\|\mathbf{x}^{(k)}-\mathbf{x}^{(k-1)}\|_{2}/\|\mathbf{x}^{(k)}\|_{2}\leq 10^{-8}.$ The left plot in Figure 1 illustrates the convergence of the three algorithms in the sense that $\|\mathbf{x}^{(k)}\|_{1}-\alpha^{(k)}\|\mathbf{x}^{(k)}\|_{2}$ goes down. Both A1 and A2 are faster than BS as BS starts with a larger range of $\alpha$ as $[1,\sqrt{n}]=[1,32]$ , while A1 and A2 start with a good initial value of $\alpha^{(0)}=\frac{\|\mathbf{x}^{(0)}\|_{1}}{\|\mathbf{x}^{(0)}\|_{2}}$ , which is very close to the final optimal value $\alpha^{\ast}$ . The right plot in Figure 1 examines the evolution of $\alpha^{(k)}$ , which gradually becomes stable and approaches to a similar value around 3.06 for all three algorithms. Figure 1 confirms the decrease property of $\alpha^{(k)}$ proved in Lemma 1.

In Theorem 1, we require the sequence $\{\mathbf{x}^{(k)}\}$ to be bounded for the convergence analysis. Here we aim at an empirical verification on the boundedness. In particular, we test on various kinds of linear systems with $F\in\{1,20\}$ and sparsity ranging from 2 to 22. In each setting, we randomly generate 50 pairs of ground-truth signals and linear systems to compute the $L_{2}$ norm of solutions obtained by A1 and A2, along with the $L_{2}$ norm of ground-truth signals. The mean values of these $L_{2}$ norms are plotted in Figure 2. As the maximum values are finite numbers, it means that the reconstructed signal is always bounded. Figure 2 also shows that the $L_{2}$ norms of A1 and A2 align quite well with the ground truth when the sparsity is below 14, no matter the system is coherent or not. When the matrix is highly coherent with more nonzero elements, both A1 and A2 give much larger values of the $L_{2}$ norm compared to the ground truth. It is because a larger $L_{2}$ norm gives rise to a smaller value in the ratio of $L_{1}/L_{2}$ that we try to minimize. In any cases, the solutions of both A1 and A2 are shown to be bounded.

Next, we compare the three algorithms with our previous ADMM approach [32]. We consider $F=1$ and $20$ with nonzero elements following the Gaussian distribution or having high dynamic ranges. We randomly simulate 50 trials for each sparsity level and compute the average of success rates, algorithm-failure rates, and computation time. The Gaussian case is illustrated in Figure 3, showing that ADMM is the worst in terms of success rates partly due to high algorithm failure rates. Here, $\rho_{1}=\rho_{2}=2000$ for ADMM and $\beta=1,\rho=20$ for A2. In addition, BS achieves the highest success rates but is the slowest. Both A1 and A2 have similar performance to BS with much reduced computation time. Figure 4 examines the case of the dynamic range for the non-zero values in $\mathbf{x}$ with $D=3$ and $5$ . Here we set $\beta=10^{-5}$ and $\rho=0.3$ for A2, while $\rho_{1}=\rho_{2}=100$ for ADMM. Similar performance is observed as the Gaussian case. In summary, we rate A1 as the most efficient algorithm for minimizing the ratio model with a balanced performance between accuracy and computational costs. We also observe that all the algorithms tend to give better performance in terms of success rates with higher dynamic ranges, which seems counter-intuitive. We will revisit this phenomenon in Section VI.

V-B Model Comparison

We intend to compare various sparse promoting models. Since the Gaussian case was conducted in our previous work [32], we focus on the dynamic range here. We compare the proposed $L_{1}/L_{2}$ model with the following models: $L_{1}$ [10], $L_{p}$ [11], $L_{1}$ - $L_{2}$ [49, 15], and TL1 [18]. We adopt $L_{1}/L_{2}$ -A1 to solve for the ratio model, as it is the most efficient algorithm from the discussion in Section V-A. The initial guess for all non-convex models is the $L_{1}$ solution obtained by Gurobi. We choose $p=1/2$ for $L_{p}$ and $a=10^{D-1}$ for TL1 when the range factor $D$ is known a priori.

Figure 5 plots the success rates of $F=1,20$ and $D=3,5$ . We observe that TL1 is the best except for the low coherence and the low dynamic case, where $L_{p}$ is the best. But $L_{p}$ is the worst in the other cases. The $L_{1}/L_{2}$ model is always the second best. Note that the ratio model is parameter-free, while the performance of TL1 largely relies on the parameter $a$ . Figure 6 examines the success rate of TL1 with different values of $a$ . We choose $a=10^{D-1}$ in the model comparison, which is almost the best among these testing values of $a$ . If no such prior information of the dynamic range were available to tune $a$ , the performance of TL1 might be worse than $L_{1}/L_{2}$ .

VI Discussions

Candés and Wakin [52] presented two principles in compressed sensing, i.e., sparsity and incoherence. We reported in our previous work [32] that higher coherence leads to better sparse recovery, which seems to contradict with the current belief in CS. In this paper, we discuss the dynamic range and reveal its effect on the exact recovery via the $L_{1}$ approach. To our best of our knowledge, there has been little discussion on the dynamic range in the CS literature, except for [51]. We consider low-coherent matrices with $F=1$ and high-coherent ones with $F=20$ . We record the success rates of different combinations of sparsity levels ( $s=2:4:22$ ) and dynamic ranges $D=0:5$ in Table I. It shows that a higher dynamic range leads a better performance. It seems that the $L_{1}$ approach is independent on $D$ for relatively sparser signals.

Now that there are three quantities that may contribute to the success of sparse recovery, i.e., sparsity, coherence, and dynamic range, we try to give a comprehensive analysis by using the relative error $\|\mathbf{x}^{\ast}-\mathbf{x}\|_{2}/\|\mathbf{x}\|_{2}$ instead of the success rates, as the latter depends on the successful threshold. We plot in Figure 7 the mean and the standard deviation of the relative errors from 50 random trails versus coherence levels ( $F=1,5,10,15,20$ ). Based on Table I, we only consider the number of non-zeros value larger than 18 and $D\geq 3$ . In each subfigure of Figure 7, the curves decrease when $F$ increases, which means that higher coherence leads to better performance. This is consistent with the observation in [32]. As for the dynamic range, we discover in Figure 7 that a larger value of $D$ leads to a smaller relative error. Finally, the sparsity affects the performance in the way that smaller relative errors can be achieved for sparser signals. These numerical phenomena have not been reported in the CS literature, which motivate for future theoretical justifications.

VII Conclusions and future works

We studied the scale-invariant and parameter-free minimization $L_{1}/L_{2}$ to promote sparsity. We presented three numerical algorithms to minimize this nonconvex model based on the relationship between $L_{1}/L_{2}$ and $L_{1}$ - $\alpha L_{2}$ for certain $\alpha$ . Experimental results compared the proposed algorithms with state-of-the-art methods in sparse recovery. Particularly important is the proposed algorithm works well when the ground-truth signal has a high dynamic range. Last but not least, we analyzed the behaviors of the $L_{1}$ approach towards the exact recovery with varying sparsity, coherence, and dynamic range. Future works include the theoretical analysis on the effect of the high dynamic range towards sparse recovery as well as the applications of the ratio model in image processing such as blind deconvolution [28, 29].

-A Proof of Lemma 1

Proof.

Based on the $\mathbf{x}$ -subproblem in (11), we get

[TABLE]

After rearranging, we get the following inequality

[TABLE]

The second inequality is owing to the convexity of Euclidean norm and the definition of $\alpha^{(k)}$ . Lemma 1 is then obtained by dividing $\|\mathbf{x}^{(k+1)}\|_{2}$ on both sides of (33). ∎

-B Proof of Lemma 2

Proof.

Simple calculations lead to

[TABLE]

For any $\mathbf{x}$ satisfying $A\mathbf{x}=\mathbf{b}$ , the minimal $L_{2}$ norm is reached by projecting the origin $\mathbf{0}$ onto the feasible set of $\{\mathbf{x}\ |\ A\mathbf{x}=\mathbf{b}\}.$ It follows from the projection operator defined in (15) that

[TABLE]

Combining (34) and (35), we get Lemma 2. ∎

-C Proof of Lemma 3

Proof.

It is straightforward to have

[TABLE]

We simplify the first term in (36) by calculating

[TABLE]

and using $\|\mathbf{x}\|_{1}\leq\sqrt{n}\|\mathbf{x}\|_{2}$ . Therefore, we get

[TABLE]

As for the second term in (36), we have it bounded by

[TABLE]

Combining (37) and (38), we obtain (29). ∎

-D Proof of Lemma 4

Proof.

It is straightforward that

[TABLE]

By the optimality condition [47], the latter relation holds if and only if there exists a vector $\mathbf{s}$ such that

[TABLE]

which implies that $\mathbf{x}^{*}$ is a critical point of (26). It follows from (28) that (26) is equivalent to (2) and hence $\mathbf{x}^{*}$ is also a critical point of (2). According to the nonexpansiveness of the proximal operator and the Lipschitz continuousness of $\nabla w$ , we have

[TABLE]

The Lemma follows. ∎

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B , vol. 58, no. 1, pp. 267–288, 1996.
2[2] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D , vol. 60, no. 1-4, pp. 259–268, 1992.
3[3] A. Berman and R. J. Plemmons, Nonnegative matrices in the mathematical sciences . SIAM, 1994.
4[4] D. L. Donoho et al. , “Compressed sensing,” IEEE Trans. Inf. Theory , vol. 52, no. 4, pp. 1289–1306, 2006.
5[5] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math , vol. 59, no. 8, pp. 1207–1223, 2006.
6[6] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM J. Comput. , vol. 24, no. 2, pp. 227–234, 1995.
7[7] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Asilomar Conf. Signals, Systems and Computers . IEEE, 1993, pp. 40–44.
8[8] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” Int. J. Control , vol. 50, no. 5, pp. 1873–1896, 1989.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Accelerated Schemes for the L1/L2L_{1}/L_{2}L1​/L2​ Minimization

Abstract

Index Terms:

I Introduction

II Numerical schemes

Proposition 1**.**

Proof.

II-A Bisection Search

II-B Adaptive Algorithms

III Connections to previous works

III-A Parameter Selection

III-B Generalized Inverse Power Methods

III-C Gradient-based Methods

Definition 1**.**

IV Convergence analysis

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Theorem 1**.**

Proof.

Remark 1**.**

Theorem 2**.**

Proof.

Remark 2**.**

Remark 3**.**

V Numerical experiments

V-A Algorithmic Comparison

V-B Model Comparison

VI Discussions

VII Conclusions and future works

-A *Proof of Lemma 1 *

Proof.

-B *Proof of Lemma 2 *

Proof.

-C Proof of Lemma 3

Proof.

-D Proof of Lemma 4

Proof.

Accelerated Schemes for the $L_{1}/L_{2}$ Minimization

Proposition 1.

Definition 1.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Theorem 1.

Remark 1.

Theorem 2.

Remark 2.

Remark 3.

-A Proof of Lemma 1

-B Proof of Lemma 2