Sparse Solutions of a Class of Constrained Optimization Problems

Lei Yang; Xiaojun Chen; Shuhuang Xiang

arXiv:1907.00880·math.OC·October 1, 2021·Math. Oper. Res.

Sparse Solutions of a Class of Constrained Optimization Problems

Lei Yang, Xiaojun Chen, Shuhuang Xiang

PDF

Open Access

TL;DR

This paper investigates properties of solutions to a class of sparse optimization problems involving nonconvex and non-Lipschitz objectives, providing bounds, solution set characterizations, and an algorithm with convergence guarantees.

Contribution

It offers new theoretical insights into the structure of solutions for sparse optimization with nonconvex penalties and proposes an effective smoothing penalty method for solving such problems.

Findings

01

Optimal solutions are on the boundary of the feasible set when 0<p<1.

02

The solution set for 0<p<1 is finite for q in {1,∞}.

03

The proposed smoothing penalty method converges to a KKT point under mild conditions.

Abstract

In this paper, we consider a well-known sparse optimization problem that aims to find a sparse solution of a possibly noisy underdetermined system of linear equations. Mathematically, it can be modeled in a unified manner by minimizing $∥ x ∥_{p}^{p}$ subject to $∥ A x - b ∥_{q} \leq σ$ for given $A \in R^{m \times n}$ , $b \in R^{m}$ , $σ \geq 0$ , $0 \leq p \leq 1$ and $q \geq 1$ . We then study various properties of the optimal solutions of this problem. Specifically, without any condition on the matrix $A$ , we provide upper bounds in cardinality and infinity norm for the optimal solutions, and show that all optimal solutions must be on the boundary of the feasible set when $0 < p < 1$ . Moreover, for $q \in {1, \infty}$ , we show that the problem with $0 < p < 1$ has a finite number of optimal solutions and prove that there exists $0 < p^{*} < 1$ such that the solution…

Equations266

x \in R^{n} min ∥ x ∥_{p}^{p} := \sum_{i = 1}^{n} ∣ x_{i} ∣^{p} \mbox s.t. ∥ A x - b ∥_{q} \leq σ,

x \in R^{n} min ∥ x ∥_{p}^{p} := \sum_{i = 1}^{n} ∣ x_{i} ∣^{p} \mbox s.t. ∥ A x - b ∥_{q} \leq σ,

\frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )} \leq ∥ x^{*} ∥_{\infty} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

\frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )} \leq ∥ x^{*} ∥_{\infty} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

x \in R^{n} min ∥ A x - b ∥_{q}^{q} + λ ∥ x ∥_{p}^{p},

x \in R^{n} min ∥ A x - b ∥_{q}^{q} + λ ∥ x ∥_{p}^{p},

x \in R^{n} min ∥ x ∥_{p}^{p} + λ (∥ A x - b ∥_{q}^{q} - σ^{q})_{+} .

x \in R^{n} min ∥ x ∥_{p}^{p} + λ (∥ A x - b ∥_{q}^{q} - σ^{q})_{+} .

x \in R^{n} min ∥ x ∥_{p}^{p} \mbox s.t. ∥ A x - b ∥_{1} \leq σ,

x \in R^{n} min ∥ x ∥_{p}^{p} \mbox s.t. ∥ A x - b ∥_{1} \leq σ,

\partial f (x)

\partial f (x)

\partial f (x)

\partial^{\infty} f (x)

{d \in R^{n} : \exists x^{k} f x, d^{k} \to d with d^{k} \in \partial f (x^{k})}

{d \in R^{n} : \exists x^{k} f x, d^{k} \to d with d^{k} \in \partial f (x^{k})}

{d \in R^{n} : \exists x^{k} f x, λ_{k} d^{k} \to d, λ_{k} ↓ 0 with d^{k} \in \partial f (x^{k})}

n^{m i n {\frac{1}{q} - \frac{1}{2}, 0}} ∥ x ∥_{2} \leq ∥ x ∥_{q} \leq n^{m a x {\frac{1}{q} - \frac{1}{2}, 0}} ∥ x ∥_{2} .

n^{m i n {\frac{1}{q} - \frac{1}{2}, 0}} ∥ x ∥_{2} \leq ∥ x ∥_{q} \leq n^{m a x {\frac{1}{q} - \frac{1}{2}, 0}} ∥ x ∥_{2} .

∥ x ∥_{q}^{q} = i = 1 \sum n ∣ x_{i} ∣^{q} = i = 1 \sum n ∣ x_{i} ∣^{q} \cdot 1 \leq (i = 1 \sum n (∣ x_{i} ∣^{q})^{\frac{2}{q}})^{\frac{q}{2}} (i = 1 \sum n 1^{\frac{2}{2 - q}})^{1 - \frac{q}{2}} = n^{1 - \frac{q}{2}} ∥ x ∥_{2}^{q},

∥ x ∥_{q}^{q} = i = 1 \sum n ∣ x_{i} ∣^{q} = i = 1 \sum n ∣ x_{i} ∣^{q} \cdot 1 \leq (i = 1 \sum n (∣ x_{i} ∣^{q})^{\frac{2}{q}})^{\frac{q}{2}} (i = 1 \sum n 1^{\frac{2}{2 - q}})^{1 - \frac{q}{2}} = n^{1 - \frac{q}{2}} ∥ x ∥_{2}^{q},

∥ x ∥_{2}^{2} = i = 1 \sum n ∣ x_{i} ∣^{2} = i = 1 \sum n ∣ x_{i} ∣^{2} \cdot 1 \leq (i = 1 \sum n (∣ x_{i} ∣^{2})^{\frac{q}{2}})^{\frac{2}{q}} (i = 1 \sum n 1^{\frac{q}{q - 2}})^{1 - \frac{2}{q}} = n^{1 - \frac{2}{q}} ∥ x ∥_{q}^{2},

∥ x ∥_{2}^{2} = i = 1 \sum n ∣ x_{i} ∣^{2} = i = 1 \sum n ∣ x_{i} ∣^{2} \cdot 1 \leq (i = 1 \sum n (∣ x_{i} ∣^{2})^{\frac{q}{2}})^{\frac{2}{q}} (i = 1 \sum n 1^{\frac{q}{q - 2}})^{1 - \frac{2}{q}} = n^{1 - \frac{2}{q}} ∥ x ∥_{q}^{2},

\frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )} \leq ∥ x^{*} ∥_{\infty} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

\frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )} \leq ∥ x^{*} ∥_{\infty} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

τ := h_{i} \neq = 0, i \in J min {\frac{x _{i}^{*}}{h _{i}}} = \frac{x _{i_{0}}^{*}}{h _{i_{0}}} for some i_{0} .

τ := h_{i} \neq = 0, i \in J min {\frac{x _{i}^{*}}{h _{i}}} = \frac{x _{i_{0}}^{*}}{h _{i_{0}}} for some i_{0} .

x_{J}^{*} + t h_{J} \neq = 0, and sgn (x_{i}^{*}) = sgn (x_{i}^{*} + t h_{i}) for i \in J .

x_{J}^{*} + t h_{J} \neq = 0, and sgn (x_{i}^{*}) = sgn (x_{i}^{*} + t h_{i}) for i \in J .

f (0)

f (0)

= t \in [- t_{0}, t_{0}] min i \in J \sum [sgn (x_{i}^{*} + t h_{i}) (x_{i}^{*} + t h_{i})]^{p} = t \in [- t_{0}, t_{0}] min f (t),

f^{''} (t) = p (p - 1) \sum_{i \in J} [sgn (x_{i}^{*}) (x_{i}^{*} + t h_{i})]^{p - 2} h_{i}^{2} < 0.

f^{''} (t) = p (p - 1) \sum_{i \in J} [sgn (x_{i}^{*}) (x_{i}^{*} + t h_{i})]^{p - 2} h_{i}^{2} < 0.

σ

σ

\displaystyle\geq m^{\min\left\{\frac{1}{q}-\frac{1}{2},\,0\right\}}(\|A_{\mathcal{J}}\bm{x}_{\mathcal{J}}^{*}\|_{2}-\|\bm{b}\|_{2})\geq m^{\min\left\{\frac{1}{q}-\frac{1}{2},\,0\right\}}\Big{(}\sqrt{\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})}\|\bm{x}_{\mathcal{J}}^{*}\|_{2}-\|\bm{b}\|_{2}\Big{)},

∥ x^{*} ∥_{\infty} \leq ∥ x^{*} ∥_{2} = ∥ x_{J}^{*} ∥_{2} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

∥ x^{*} ∥_{\infty} \leq ∥ x^{*} ∥_{2} = ∥ x_{J}^{*} ∥_{2} \leq \frac{σ m ^{m a x {\frac{1}{2} - \frac{1}{q}, 0}} + ∥ b ∥ _{2}}{λ _{m i n} ( A _{J}^{⊤} A _{J} )},

σ

σ

\geq ∥ b ∥_{q} - m^{m a x {\frac{1}{q} - \frac{1}{2}, 0}} ∥ A_{J} x_{J}^{*} ∥_{2} \geq ∥ b ∥_{q} - m^{m a x {\frac{1}{q} - \frac{1}{2}, 0}} λ_{m a x} (A_{J}^{⊤} A_{J}) ∥ x_{J}^{*} ∥_{2},

∥ x^{*} ∥_{\infty} = ∥ x_{J}^{*} ∥_{\infty} \geq \frac{∥ x _{J}^{*} ∥ _{2}}{∣ J ∣} \geq \frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )},

∥ x^{*} ∥_{\infty} = ∥ x_{J}^{*} ∥_{\infty} \geq \frac{∥ x _{J}^{*} ∥ _{2}}{∣ J ∣} \geq \frac{( ∥ b ∥ _{q} - σ ) m ^{m i n {\frac{1}{2} - \frac{1}{q}, 0}}}{∣ J ∣ λ _{m a x} ( A _{J}^{⊤} A _{J} )},

\forall x, y \in P_{j} ⟹ x_{i} y_{i} \geq 0 for i = 1, \dots, n .

\forall x, y \in P_{j} ⟹ x_{i} y_{i} \geq 0 for i = 1, \dots, n .

x \in R^{n} min ∥ x ∥_{p}^{p} \mbox s.t. x \in P_{j} \cap FEA (A, b, σ, 1)

x \in R^{n} min ∥ x ∥_{p}^{p} \mbox s.t. x \in P_{j} \cap FEA (A, b, σ, 1)

∥ x^{*} ∥_{p}^{p}

∥ x^{*} ∥_{p}^{p}

\geq \sum_{j = 1}^{n} (λ ∣ y_{j} ∣^{p} + (1 - λ) ∣ z_{j} ∣^{p}) = λ ∥ y ∥_{p}^{p} + (1 - λ) ∥ z ∥_{p}^{p} \geq ∥ x^{*} ∥_{p}^{p},

{\rm EXT}\left(\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)\right):=\big{\{}\mathrm{all~{}extreme~{}points~{}of~{}}\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)\big{\}}.

{\rm EXT}\left(\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)\right):=\big{\{}\mathrm{all~{}extreme~{}points~{}of~{}}\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)\big{\}}.

SOL (A, b, σ, p, 1) \subseteq j \in {1, \dots, 2^{n}} ⋃ EXT (P_{j} \cap FEA (A, b, σ, 1)) .

SOL (A, b, σ, p, 1) \subseteq j \in {1, \dots, 2^{n}} ⋃ EXT (P_{j} \cap FEA (A, b, σ, 1)) .

0 < p < 1 ⋃ SOL (A, b, σ, p, 1) \subseteq j \in {1, \dots, 2^{n}} ⋃ EXT (P_{j} \cap FEA (A, b, σ, 1)),

0 < p < 1 ⋃ SOL (A, b, σ, p, 1) \subseteq j \in {1, \dots, 2^{n}} ⋃ EXT (P_{j} \cap FEA (A, b, σ, 1)),

a_{1} \leq a_{2} \leq \dots \leq a_{n}, b_{1} \leq b_{2} \leq \dots \leq b_{n}, \sum_{j = 1}^{n} a_{j}^{k} = \sum_{j = 1}^{n} b_{j}^{k}, k = 1, \dots, n,

a_{1} \leq a_{2} \leq \dots \leq a_{n}, b_{1} \leq b_{2} \leq \dots \leq b_{n}, \sum_{j = 1}^{n} a_{j}^{k} = \sum_{j = 1}^{n} b_{j}^{k}, k = 1, \dots, n,

Δ_{k} (a, b) := \sum_{j = 1}^{s} ((ln ∣ a_{i_{j}} ∣)^{k} - (ln ∣ b_{t_{j}} ∣)^{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Variational Analysis · Sparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research

Full text

\NatBibNumeric\TheoremsNumberedBySection\EquationsNumberedBySection

\TITLE

Sparse Solutions of a Class of Constrained Optimization Problems

\ARTICLEAUTHORS\AUTHOR

Lei Yang \AFF Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076. ([email protected]) \AUTHORXiaojun Chen \AFFDepartment of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China. ([email protected]) \AUTHORShuhuang Xiang \AFFSchool of Mathematics and Statistics, INP-LAMA, Central South University, Changsha, Hunan 410083, China. ([email protected])

\ABSTRACT

In this paper, we consider a well-known sparse optimization problem that aims to find a sparse solution of a possibly noisy underdetermined system of linear equations. Mathematically, it can be modeled in a unified manner by minimizing $\|\bm{x}\|_{p}^{p}$ subject to $\|A\bm{x}-\bm{b}\|_{q}\leq\sigma$ for given $A\in\mathbb{R}^{m\times n}$ , $\bm{b}\in\mathbb{R}^{m}$ , $\sigma\geq 0$ , $0\leq p\leq 1$ and $q\geq 1$ . We then study various properties of the optimal solutions of this problem. Specifically, without any condition on the matrix $A$ , we provide upper bounds in cardinality and infinity norm for the optimal solutions, and show that all optimal solutions must be on the boundary of the feasible set when $0<p\leq 1$ . Moreover, for $q\in\{1,\infty\}$ , we show that the problem with $0<p<1$ has a finite number of optimal solutions and prove that there exists $0<p^{*}<1$ such that the solution set of the problem with any $0<p<p^{*}$ is contained in the solution set of the problem with $p=0$ and there further exists $0<\overline{p}<p^{*}$ such that the solution set of the problem with any $0<p\leq\overline{p}$ remains unchanged. An estimation of such $p^{*}$ is also provided. In addition, to solve the constrained nonconvex non-Lipschitz $L_{p}$ - $L_{1}$ problem ( $0<p<1$ and $q=1$ ), we propose a smoothing penalty method and show that, under some mild conditions, any cluster point of the sequence generated is a stationary point of our problem. Some numerical examples are given to implicitly illustrate the theoretical results and show the efficiency of the proposed algorithm for the constrained $L_{p}$ - $L_{1}$ problem under different noises.

\KEYWORDS

Sparse optimization; nonconvex non-Lipschitz optimization; cardinality minimization; penalty method; smoothing approximation. \MSCCLASSPrimary: 90C26, 90C30; secondary: 65K05 \ORMSCLASSPrimary: mathematics, systems solution; secondary: programming, algorithms

1 Introduction

In this paper, we consider a class of sparse optimization problems, which can be modeled in a unified manner as the following constrained $L_{p}$ - $L_{q}$ problem:

[TABLE]

where $A\in\mathbb{R}^{m\times n}$ , $\bm{b}\in\mathbb{R}^{m}$ , $\sigma\geq 0$ , $0\leq p\leq 1$ and $1\leq q\leq\infty$ are given. We assume that the feasible set of problem (1) is nonempty so that problem (1) is well-defined. With this assumption, one can easily verify that an optimal solution for $p=0$ (namely, a sparsest solution) exists thanks to the discrete and discontinuous nature of $\|\cdot\|_{0}$ and the closedness of the feasible set. Moreover, for $0<p\leq 1$ , since $\|\bm{x}\|_{p}^{p}$ is level-bounded, then an optimal solution exists (see [33, Theorem 1.9]). Therefore, the optimal solution set of problem (1), denoted by ${\rm SOL}(A,\bm{b},\sigma,p,q)$ , is nonempty for any $0\leq p\leq 1$ and $1\leq q\leq\infty$ . We also assume that $\|\bm{b}\|_{q}>\sigma$ so that $A\neq 0$ and $0\not\in{\rm SOL}(A,\bm{b},\sigma,p,q)$ . Obviously, when $p=1$ , (1) is a convex optimization problem and when $0<p<1$ , (1) yields a nonconvex and non-Lipschitz optimization problem.

Problem (1) aims to find a sparse vector $\bm{x}$ from the corrupted observation $\bm{b}=A\bm{x}+\bm{\xi}$ , where $\bm{\xi}$ denotes an unknown noisy vector bounded by $\sigma$ (the noise level) in $L_{q}$ -norm, i.e., $\|\bm{\xi}\|_{q}\leq\sigma$ . This problem arises in many contemporary applications and has been widely studied under different choices of $p$ , $q$ and $\sigma$ in the literature; see, for example, [3, 4, 5, 6, 7, 8, 12, 13, 15, 16, 17, 18, 20, 23, 31, 34, 35, 41, 42, 43]. Among these studies, the $L_{2}$ -norm is commonly used for measuring the noise and leads to a mathematically tractable problem when the noise exists and comes from a Gaussian distribution [3, 5, 12, 13, 17, 20, 34, 35]. In particular, it has been known that a sparse vector can be (approximately) recovered by the solution of the convex optimization problem (1) with $p=1$ and $q=2$ under some well-known recovery conditions such as the restricted isometry property (RIP) [5], the mutual coherence condition [3, 17] and the null space property (NSP) [15, 41]. Such convex constrained $L_{1}$ - $L_{2}$ problem can also be solved efficiently by a spectral projected gradient $L_{1}$ minimization algorithm (SPGL1) proposed by Van den Berg and Friedlander [35]. On the other hand, it is natural to find a sparse vector by solving problem (1) with $0<p<1$ since $\|\bm{x}\|_{p}^{p}$ approaches $\|\bm{x}\|_{0}$ as $p\to 0$ . Indeed, under certain RIP conditions, Foucart and Lai [20] showed that a sparse vector can be (approximately) recovered by the solution of the nonconvex non-Lipschitz problem (1) with $0<p<1$ and $q=2$ . Chen, Lu and Pong [12] also proposed a penalty method for solving this constrained $L_{p}$ - $L_{2}$ problem ( $0<p<1$ ) with promising numerical performances. Later, this penalty method and the SPGL1 are further combined to solve (1) with $0<p<1$ and $q=2$ for recovering sparse signals on the sphere in [13]. However, when the noise does not come from the Gaussian distribution but other heavy-tailed distributions (e.g., Student’s t-distribution) or contains outliers, using $\|A\bm{x}-\bm{b}\|_{2}$ as the data fitting term is no longer appropriate. In this case, some robust loss functions such as the $L_{1}$ -norm [19, 36, 37] and the $L_{\infty}$ -norm [4, 7] are used to develop robust models. Recently, Zhao, Jiang and Luo [43] also established a fairly comprehensive weak stability theory for problem (1) with $p=1$ and $q\in\{1,2,\infty\}$ under a so-called weak range space property (RSP) condition. The weak RSP condition can be induced by several existing compressed sensing matrix properties and hence can be the mildest one for the sparse solution recovery. However, it is still not easy to verify this condition in practice.

In this paper, we focus on problem (1) with different choices of $p$ and $q$ , and establish the following theoretical results concerning its optimal solutions without any condition on the sensing matrix $A$ .

(i)

For any $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,q)$ with $0\leq p<1$ and $1\leq q\leq\infty$ , we have $\|\bm{x}^{*}\|_{0}=\mathrm{rank}(A_{\mathcal{J}})$ and

[TABLE]

where ${\mathcal{J}}=\mathrm{supp}(\bm{x}^{*})$ , $|\mathcal{J}|$ denotes its cardinality, and $\lambda_{\max}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})$ and $\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})$ are the largest and smallest eigenvalues of $A_{\mathcal{J}}^{\top}A_{\mathcal{J}}$ , respectively. Moreover, for any $1\leq q\leq\infty$ , $\|A\bm{x}^{*}-\bm{b}\|_{q}=\sigma$ for $0<p\leq 1$ ; and $\|A(\alpha\bm{x}^{*})-\bm{b}\|_{q}=\sigma$ with some $\alpha\in(0,1]$ for $p=0$ .

(ii)

For $q\in\{1,\infty\}$ , the solution set SOL $(A,\bm{b},\sigma,p,q)$ with $0<p<1$ has a finite number of elements.

(iii)

There exists a $p^{*}\in(0,\,1]$ such that ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ for any $p\in(0,\,p^{*})$ . An explicit estimation of such $p^{*}\in(0,1]$ is also given. Moreover, there exists a $\overline{p}\in(0,\,p^{*})$ such that ${\rm SOL}(A,\bm{b},\sigma,p,1)={\rm SOL}(A,\bm{b},\sigma,\overline{p},1)$ for any $p\in(0,\,\overline{p}]$ .

Here, we would like to point out that the sparse solution recovery result (iii) is developed without any aforementioned recovery condition on $A$ . This not only complements the existing recovery results in the literature, but also shows the potential advantage of using the $L_{p}$ -norm ( $0<p<1$ ) for recovering the sparse solution over the $L_{1}$ -norm ball.

Note that problem (1) is a constrained problem, while, in statistics and computer science, the $L_{p}$ - $L_{q}$ problem/minimization often refers to the following unconstrained regularized problem [9, 11, 14]:

[TABLE]

where $\lambda$ is a positive regularization parameter. Indeed, when $p=1$ and $q=2$ , problem (2) is the well-known $L_{1}$ -regularized least-squares problem (namely, the LASSO problem) and it is known that, in this case, there exists a $\bar{\lambda}>0$ such that, for $\lambda\geq\bar{\lambda}$ , the constrained problem (1) is equivalent to the unconstrained problem (2) regarding solutions; see, for example, [3, Section 3.2.3]. However, Example 3.1 in [12] shows that for $0<p<1$ and $q=2$ , there does not exist a $\lambda$ so that problems (1) and (2) have a common global or local minimizer. Hence, for $0<p<1$ , one cannot expect to solve (1) by solving the regularized problem (2) with some fixed $\lambda>0$ . In view of this, we shall consider a penalty method for solving problem (1) with $0<p<1$ , which basically solves the constrained problem (1) by solving a sequence of unconstrained penalty problems. Specifically, we consider the following penalty problem of (1):

[TABLE]

Note that the function $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|^{q}_{q}$ is continuously differentiable for $1<q<\infty$ . Then, based on problem (3), one can readily extend the penalty method proposed in [12] for solving problem (1) with $0<p<1$ and $q=2$ to solve problem (1) with $0<p<1$ and $1<q<\infty$ . However, for $q\in\{1,\infty\}$ , since the function $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{q}$ is nonsmooth, then the approach in [12] cannot be adapted directly. In view of this, in this paper, we propose an alternative smoothing penalty method for solving

[TABLE]

where $0<p<1$ . Notice that we omit the case of $q=\infty$ to save space in this paper. Nevertheless, our approach can be extended without much difficulty to solve problem (1) with $0<p<1$ and $q=\infty$ , because the $L_{1}$ -constrained problem and the $L_{\infty}$ -constrained problem have similar properties in the sense that both constraints $\|A\bm{x}-\bm{b}\|_{1}\leq\sigma$ and $\|A\bm{x}-\bm{b}\|_{\infty}\leq\sigma$ can be represented as linear constraints, and the functions $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{1}$ and $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{\infty}$ are piecewise linear. We shall show that problem (3) with $q=1$ is the exact penalty problem of problem (4) regarding local minimizers and global minimizers. We also prove that any cluster point of a sequence generated by our smoothing penalty method is a stationary point of problem (4). Moreover, some numerical results are reported to show that all computed stationary points have the properties in our theoretical contribution (i) mentioned above. Here, we would like to emphasize that finding a global optimal solution of (4) is NP-hard [11, 21]. Thus, it is interesting to see that our smoothing penalty method can efficiently find a ‘good’ stationary point of problem (4), which has important properties of a global optimal solution of problem (4).

The rest of this paper is organized as follows. In Section 2, we rigorously prove properties (i)-(iii) listed above and give a concrete example to verify these properties. In Section 3, we present a smoothing penalty method for solving problem (4) and show some convergence results. Some numerical results are presented in Section 4, with some concluding remarks given in Section 5.

Notation and Preliminaries

In this paper, we use the convention that $\frac{1}{\infty}=0$ . For an index set $\mathcal{J}\subseteq\{1,\cdots,n\}$ , let $|\mathcal{J}|$ denote its cardinality and $\mathcal{J}^{c}$ denote its complementarity set. We denote by $\bm{x}_{\mathcal{J}}\in\mathbb{R}^{|\mathcal{J}|}$ the subvector formed from a vector $\bm{x}\in\mathbb{R}^{n}$ by picking the entries indexed by $\mathcal{J}$ and denote by $A_{\mathcal{J}}\in\mathbb{R}^{m\times|\mathcal{J}|}$ the submatrix formed from a matrix $A\in\mathbb{R}^{m\times n}$ by picking the columns indexed by $\mathcal{J}$ . Recall from [33, Definition 8.3] that, for a proper closed function $f$ , the regular (or Fréchet) subdifferential, the (limiting) subdifferential and the horizon subdifferential of $f$ at $\bm{x}\in{\rm dom}\,f$ are defined respectively as

[TABLE]

It can be observed from the above definitions (or see [33, Proposition 8.7]) that

[TABLE]

When $f$ is convex, the above (limiting) subdifferential coincides with the classical subdifferential in convex analysis [33, Proposition 8.12]. Moreover, if $f$ is continuously differentiable, we have $\partial f(\bm{x})=\{\nabla f(\bm{x})\}$ , where $\nabla f(\bm{x})$ is the gradient of $f$ at $\bm{x}$ [33, Exercise 8.8(b)]. For a closed set $\mathcal{X}\subseteq\mathbb{R}^{n}$ , its indicator function $\delta_{\mathcal{X}}$ is defined by $\delta_{\mathcal{X}}(\bm{x})=0$ if $\bm{x}\in\mathcal{X}$ and $\delta_{\mathcal{X}}(\bm{x})=+\infty$ otherwise. In addition, we use $\mathcal{B}(\bm{y};\delta)$ to denote the closed ball of radius $\delta$ centered at $\bm{y}$ , i.e., $\mathcal{B}(\bm{y};\delta):=\{\bm{x}\in\mathbb{R}^{n}:\|\bm{x}-\bm{y}\|_{2}\leq\delta\}$ , and ${\rm FEA}(A,\bm{b},\sigma,q):=\left\{\bm{x}\in\mathbb{R}^{n}:\|A\bm{x}-\bm{b}\|_{q}\leq\sigma\right\}$ to denote the feasible set of problem (1).

2 Properties of solutions of problem (1)

In this section, we characterize the properties of the optimal solutions of problem (1) with different choices of $p$ and $q$ . We first give a supporting lemma.

Lemma 2.1

Let $1\leq q\leq\infty$ . For any $\bm{x}\in\mathbb{R}^{n}$ , we have

[TABLE]

Proof. We consider the following two cases.

•

$1\leq q\leq 2$ . In this case, it is easy to see that $\|\bm{x}\|_{q}\geq\|\bm{x}\|_{2}$ . On the other hand, since $2/q\geq 1$ , it then follows from the Hölder’s inequality that

[TABLE]

which results in $\|\bm{x}\|_{q}\leq n^{\frac{1}{q}-\frac{1}{2}}\|\bm{x}\|_{2}$ .

•

$q\geq 2$ . In this case, it is easy to see that $\|\bm{x}\|_{q}\leq\|\bm{x}\|_{2}$ . On the other hand, since $q/2\geq 1$ , it then follows from the Hölder’s inequality that

[TABLE]

which results in $\|\bm{x}\|_{q}\geq n^{\frac{1}{q}-\frac{1}{2}}\|\bm{x}\|_{2}$ .

Combing the above results, we prove this lemma.

The following theorem is given for $0\leq p\leq 1$ and $1\leq q\leq\infty$ .

Theorem 2.2

Let $1\leq q\leq\infty$ . For any $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,q)$ , the following statements hold with ${\mathcal{J}}:=\mathrm{supp}(\bm{x}^{*})$ .

(i)

For $0<p\leq 1$ , $\|A\bm{x}^{*}-\bm{b}\|_{q}=\sigma$ ; and for $p=0$ , there is a scalar $\alpha\in(0,1]$ such that $\|A(\alpha\bm{x}^{*})-\bm{b}\|_{q}=\sigma$ and $\alpha^{\prime}\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,0,q)$ for any $\alpha^{\prime}\in[\alpha,\,1]$ .

(ii)

For $0\leq p<1$ , $\|\bm{x}^{*}\|_{0}=|\mathcal{J}|=\mathrm{rank}(A_{\mathcal{J}})$ .

(iii)

For $0\leq p<1$ ,

[TABLE]

where $\lambda_{\max}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})$ and $\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})$ are the largest and smallest eigenvalues of $A_{\mathcal{J}}^{\top}A_{\mathcal{J}}$ , respectively. Moreover, when $\sigma=0$ , we have $\bm{x}_{\mathcal{J}}^{*}=(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})^{-1}A_{\mathcal{J}}^{\top}\bm{b}$ .

Proof. Statement (i). If $\|A\bm{x}^{*}-\bm{b}\|_{q}=\sigma$ , the results hold trivially. Next, we assume that $\|A\bm{x}^{*}-\bm{b}\|_{q}<\sigma$ .

Consider $0<p\leq 1$ . From $\|\bm{b}\|_{q}>\sigma$ , we see that $A\bm{x}^{*}\neq 0$ . Then, it is easy to verify that there exists a constant $0<c<1$ such that $\|A(c\bm{x}^{*})-\bm{b}\|_{q}<\sigma$ . Thus, $c\bm{x}^{*}\in{\rm FEA}(A,\bm{b},\sigma,q)$ , but $\|c\bm{x}^{*}\|^{p}_{p}=c^{p}\|\bm{x}^{*}\|_{p}^{p}<\|\bm{x}^{*}\|_{p}^{p}$ for $0<p\leq 1$ . This leads to a contradiction. Hence, we have $\|A\bm{x}^{*}-\bm{b}\|_{q}=\sigma$ .

Consider $p=0$ . Let $f(t):=\|A(t\bm{x}^{*})-\bm{b}\|_{q}$ . Then, from the continuity of $f$ , $f(0)=\|\bm{b}\|_{q}>\sigma$ and $f(1)=\|A\bm{x}^{*}-\bm{b}\|_{q}<\sigma$ , there exists a scalar $\alpha\in(0,1)$ such that $f(\alpha)=\|A(\alpha\bm{x}^{*})-\bm{b}\|_{q}=\sigma$ . Moreover, it is easy to verify that $f$ is convex on $[0,\,1]$ . Thus, for any $\alpha^{\prime}\in[\alpha,\,1]$ , there exists a $0\leq\lambda\leq 1$ such that $\alpha^{\prime}=\lambda\alpha+(1-\lambda)$ and $f(\alpha^{\prime})\leq\lambda f(\alpha)+(1-\lambda)f(1)\leq\sigma$ . Hence, $\alpha^{\prime}\bm{x}^{*}$ is feasible. This together with $\|\alpha^{\prime}\bm{x}^{*}\|_{0}=\|\bm{x}^{*}\|_{0}$ shows that $\alpha^{\prime}\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,0,q)$ for any $\alpha^{\prime}\in[\alpha,\,1]$ .

Statement (ii). Let $s:=\|\bm{x}^{*}\|_{0}=|\mathcal{J}|$ for simplicity. We then consider the following two cases.

Case 1, $p=0$ . First, it is not hard to see that $s\leq m$ since any set of $m+1$ vectors in $\mathbb{R}^{m}$ is linearly dependent. Thus, we have $\mathrm{rank}(A_{\mathcal{J}})\leq\min\{m,\,s\}=s$ . We next prove $\mathrm{rank}(A_{\mathcal{J}})=s$ by contradiction. Assume that $\mathrm{rank}(A_{\mathcal{J}})<s$ . Then, there exists a vector $\hat{\bm{h}}\in\mathbb{R}^{s}$ such that $\hat{\bm{h}}\neq 0$ and $A_{\mathcal{J}}\hat{\bm{h}}=0$ . Let $\bm{h}\in\mathbb{R}^{n}$ be a vector such that $\bm{h}_{\mathcal{J}}=\hat{\bm{h}}$ and $\bm{h}_{\mathcal{J}^{c}}=0$ . Thus, we have $A\bm{h}=0$ . Now, let

[TABLE]

Then, we see that $\tilde{\bm{x}}:=\bm{x}^{*}-\tau\bm{h}\in{\rm FEA}(A,\bm{b},\sigma,q)$ since $A\tilde{\bm{x}}=A(\bm{x}^{*}-\tau\bm{h})=A\bm{x}^{*}$ . Moreover, from the definition of $\tau$ , one can verify that $\tilde{x}_{i_{0}}=0$ and thus $\|\tilde{\bm{x}}\|_{0}<\|\bm{x}^{*}\|_{0}$ . This leads to a contradiction. Hence, we only have $\mathrm{rank}(A_{\mathcal{J}})=s=\|\bm{x}^{*}\|_{0}$ .

Case 2, $0<p<1$ . We first prove $s\leq m$ by contradiction. Assume that $s>m$ . Thus, there exists a vector $\tilde{\bm{h}}\in\mathbb{R}^{s}$ such that $\tilde{\bm{h}}\neq 0$ and $A_{\mathcal{J}}\tilde{\bm{h}}=0$ , since $\mathrm{rank}(A_{\mathcal{J}})\leq\min\{m,\,s\}=m<s$ . Let $\bm{h}\in\mathbb{R}^{n}$ be a vector such that $\bm{h}_{\mathcal{J}}=\tilde{\bm{h}}$ and $\bm{h}_{\mathcal{J}^{c}}=0$ . Thus, we have that $A\bm{h}=0$ and hence $\bm{x}^{*}+t\bm{h}\in{\rm FEA}(A,\bm{b},\sigma,q)$ for any $t\in\mathbb{R}$ . Moreover, we can choose a sufficiently small real positive number $t_{0}>0$ such that, for all $|t|\leq t_{0}$ ,

[TABLE]

Let $f(t):=\sum_{i\in\mathcal{J}}\left[\mathrm{sgn}(x_{i}^{*})(x^{*}_{i}+th_{i})\right]^{p}$ . Then, we have

[TABLE]

where the third equality follows because $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,q)$ and the last equality follows from (6). However, for all $|t|\leq t_{0}$ ,

[TABLE]

This leads to a contradiction. Hence, we have $s\leq m$ and $\mathrm{rank}(A_{\mathcal{J}})\leq\min\{m,\,s\}=s$ . We further assume that $\mathrm{rank}(A_{\mathcal{J}})<s$ . Then, there also exists a vector $\hat{\bm{h}}\in\mathbb{R}^{s}$ such that $\hat{\bm{h}}\neq 0$ and $A_{\mathcal{J}}\hat{\bm{h}}=0$ . Using the similar arguments as above, we can get a contradiction. Hence, we only have that $\mathrm{rank}(A_{\mathcal{J}})=s$ .

Statement (iii). From statement (ii), $A_{\mathcal{J}}$ has full column rank and hence $\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})\neq 0$ . Then, we see that

[TABLE]

where the second inequality follows from Lemma 2.1 and the last inequality follows from $\|A_{\mathcal{J}}\bm{x}_{\mathcal{J}}^{*}\|_{2}^{2}\geq\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})\|\bm{x}_{\mathcal{J}}^{*}\|_{2}^{2}$ . Thus, the above relation implies that

[TABLE]

which gives the upper bound for $\|\bm{x}^{*}\|_{\infty}$ . On the other hand, we have

[TABLE]

where the third inequality follows from Lemma 2.1 and the last inequality follows from $\|A_{\mathcal{J}}\bm{x}_{\mathcal{J}}^{*}\|_{2}^{2}\leq\lambda_{\max}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})\|\bm{x}_{\mathcal{J}}^{*}\|_{2}^{2}$ . This results in

[TABLE]

which gives the lower bound for $\|\bm{x}^{*}\|_{\infty}$ . Recall that $\|\bm{b}\|_{q}>\sigma$ (our blanket assumption). Thus, this lower bound is nontrivial. Moreover, when $\sigma=0$ , we have $A\bm{x}^{*}=A_{\mathcal{J}}\bm{x}_{\mathcal{J}}^{*}=\bm{b}$ and hence $\bm{x}_{\mathcal{J}}^{*}=(A_{\mathcal{J}}^{\top}A_{\mathcal{J}})^{-1}A_{\mathcal{J}}^{\top}\bm{b}$ . We then complete the proof.

Remark 2.3 (The sparse solution of the $L_{p}$ - $L_{2}$ problem)

Theorem 2.2(ii) implies that without any condition on the sensing matrix $A$ , $\|\bm{x}^{*}\|_{0}\leq\min(m,n)$ for any $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,q)$ with $0<p<1$ and $1\leq q\leq\infty$ , while Shen and Mousavi show in [34, Proposition 3.1] that $\|\bm{x}^{*}\|_{0}\geq n-m+1$ for any $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,2)$ with $p>1$ and $n\geq m$ if every $m\times m$ submatrix of $A$ is invertible. Combining these results gives a formal confirmation that if $n\gg m$ , all solutions of the $L_{p}$ - $L_{2}$ problem with $0\leq p<1$ are sparse, but the $L_{p}$ - $L_{2}$ problem with $p>1$ may not have sparse solutions.

In the following, we shall derive more theoretical results for the optimal solution set of the $L_{1}$ -constrained problem (4) with $0\leq p<1$ . But we should point out that all results established later can be extended without much difficulty to the $L_{\infty}$ -constrained case or other more general cases; see Remarks 2.6 and 2.10 for more details. As we shall see later, solving problem (4) with an arbitrarily sufficiently small $0<p<1$ actually gives an optimal solution of problem (4) with $p=0$ . This nice result is obtained based on a simple observation that the feasible set ${\rm FEA}(A,\bm{b},\sigma,1)$ is indeed a convex polyhedron in $\mathbb{R}^{n}$ (see Lemma 8.1). Moreover, observe that $\mathbb{R}^{n}$ can be represented as a union of $2^{n}$ orthants, denoted by $\mathbb{P}_{j}$ for $j=1,\cdots,2^{n}$ , such that any two vectors $\bm{x}$ and $\bm{y}$ in each $\mathbb{P}_{j}$ have the same sign for each entry, i.e., for each $\mathbb{P}_{j}$ , we have

[TABLE]

For example, when $n=2$ , we have $\mathbb{R}^{2}=\bigcup^{4}_{j=1}\mathbb{P}_{j}$ , where $\mathbb{P}_{1}=\{\bm{x}:x_{1}\geq 0,x_{2}\geq 0\}$ , $\mathbb{P}_{2}=\{\bm{x}:x_{1}\geq 0,x_{2}\leq 0\}$ , $\mathbb{P}_{3}=\{\bm{x}:x_{1}\leq 0,x_{2}\geq 0\}$ and $\mathbb{P}_{4}=\{\bm{x}:x_{1}\leq 0,x_{2}\leq 0\}$ . Then, for each $j$ , one can see that $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ is empty or a polyhedron that has a finite number of extreme points because $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ contains no lines; see [32, Corollary 18.5.3] and [32, Corollary 19.1.1].

Lemma 2.4

Let $0<p<1$ . Suppose that $j\in\{1,\cdots,2^{n}\}$ is an arbitrary index such that $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)\neq\emptyset$ , where $\mathbb{P}_{j}$ is defined in (7). Then, any optimal solution of the following problem

[TABLE]

is an extreme point of $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ .

Proof. Let $\bm{x}^{*}$ be an optimal solution of (8). Suppose that there exist $\bm{y},\,\bm{z}\in\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ such that $\bm{x}^{*}=\lambda\bm{y}+(1-\lambda)\bm{z}$ for some $0<\lambda<1$ . Then, we have

[TABLE]

where the third equality follows because any $\bm{y},\,\bm{z}\in\mathbb{P}_{j}$ have the same sign for each entry, the first inequality follows because $f(t)=t^{p}$ is strictly concave for $t\geq 0$ , and the last inequality follows because $\bm{x}^{*}$ is an optimal solution of (8). Note that the above relation holds if and only if $\bm{y}=\bm{z}=\bm{x}^{*}$ . This implies that $\bm{x}^{*}$ is an extreme point of $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ .

Based on Lemma 2.4, we are able to characterize the number of the optimal solutions of problem (4) with $0<p<1$ . For notational simplicity, for $j=1,\cdots,2^{n}$ , let

[TABLE]

Proposition 2.5

For any $0<p<1$ , the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,p,1)$ of problem (4) is a finite set. Moreover, the set $\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ is a finite set.

Proof. For a given $0<p<1$ , let $\bm{x}^{*}$ be an optimal solution of problem (4), i.e., $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,1)$ . Then, there must exist a $j^{*}\in\{1,\cdots,2^{n}\}$ such that $\bm{x}^{*}\in\mathbb{P}_{j^{*}}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ and $\bm{x}^{*}$ is also an optimal solution of (8) with $j^{*}$ in place of $j$ . Then, it follows from Lemma 2.4 that $\bm{x}^{*}$ is an extreme point of $\mathbb{P}_{j^{*}}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ . This implies that

[TABLE]

Note that, for each $j$ , $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ is empty or a polyhedron that has a finite number of extreme points since $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ contains no lines; see [32, Corollary 18.5.3] and [32, Corollary 19.1.1]. This together with (9) implies that ${\rm SOL}(A,\bm{b},\sigma,p,1)$ is a finite set.

Moreover, since (9) holds for any $0<p<1$ , then we have

[TABLE]

which implies $\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ is a finite set. This completes the proof.

Remark 2.6 (Comments on Proposition 2.5)

Proposition 2.5 is obtained based on the observation that the feasible set ${\rm FEA}(A,\bm{b},\sigma,1)$ is a convex polyhedron in $\mathbb{R}^{n}$ . From this observation, we can extend Proposition 2.5 to that for any $0<p<1$ , the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,p,q)$ of (1) with $q=\infty$ is a finite set. However, it is not clear whether for any $0<p<1$ , the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,p,q)$ of problem (1) with $q=2$ is a finite set. Thanks to Theorem 2.2, we can claim that if $A$ satisfies $\mathrm{rank}(A)=2$ , the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,\frac{1}{k},2)$ is a finite set, where $k\geq 2$ is a positive integer. Indeed, in this case, by Theorem 2.2(ii), any optimal solution $\bm{x}^{*}$ satisfies that $\|\bm{x}^{*}\|_{0}=|\mathcal{J}|=\mathrm{rank}(A_{\mathcal{J}})\leq\mathrm{rank}(A)=2$ and hence has at most two nonzero entries supported on $\mathcal{J}$ . Then, there are only $\frac{n(n-1)}{2}$ different choices of the support set $\mathcal{J}$ . Let $\nu^{*}$ be the optimal objective value and, without loss of generality, assume that $x^{*}_{1}\geq 0$ , $x^{*}_{2}\geq 0$ , $x^{*}_{3}=\cdots=x^{*}_{n}=0$ . Then, $\sqrt[k]{x^{*}_{1}}+\sqrt[k]{x^{*}_{2}}=\nu^{*}$ . Also, let $t:=\sqrt[k]{x^{*}_{1}}$ and $\sqrt[k]{x^{*}_{2}}=\nu^{*}-t$ . We then see from Theorem 2.2(i) that $\|A_{\mathcal{J}}\bm{x}^{*}_{\mathcal{J}}-\bm{b}\|^{2}_{2}=\|A\bm{x}^{*}-\bm{b}\|^{2}_{2}=\sigma^{2}$ and this equation can be further written as a $2k$ -th order polynomial equation $f(t)=0$ , which has at most $2k$ real roots. This implies that, for each $\mathcal{J}$ satisfying $|\mathcal{J}|=2$ , there are only $2k$ different choices of $x^{*}_{1}$ and $x^{*}_{2}$ . Hence, the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,\frac{1}{k},2)$ is a finite set and the number of solutions is at most $n(n-1)k$ .

We next give two supporting lemmas and relegate the proofs to Appendices 6 and 7, respectively.

Lemma 2.7

Suppose that $\bm{a}=(a_{1},\cdots,a_{n})^{\top}\in\mathbb{R}^{n}$ and $\bm{b}=(b_{1},\cdots,b_{n})^{\top}\in\mathbb{R}^{n}$ satisfy

[TABLE]

then $\bm{a}=\bm{b}$ .

Lemma 2.8

Given $\bm{a}$ , $\bm{b}\in\mathbb{R}^{n}$ with $\|\bm{a}\|_{0}=\|\bm{b}\|_{0}=s$ . Let $\{a_{i_{1}},\cdots,a_{i_{s}}\}$ and $\{b_{t_{1}},\cdots,b_{t_{s}}\}$ be the nonzero entries in $\bm{a}$ and $\bm{b}$ , respectively, and, without loss of generality, assume that $|a_{i_{1}}|\leq\cdots\leq|a_{i_{s}}|$ and $|b_{t_{1}}|\leq\cdots\leq|b_{t_{s}}|$ . For $k=1,\cdots,s$ , define

[TABLE]

Then, the following statements hold.

(i)

If $\Delta_{k}(\bm{a},\bm{b})=0$ for all $k=1,\cdots,s$ , then $\|\bm{a}\|_{p}^{p}=\|\bm{b}\|_{p}^{p}$ holds for any $p>0$ .

(ii)

Otherwise, there exists a sufficiently small $p^{\prime}$ such that either $\|\bm{a}\|_{p}^{p}<\|\bm{b}\|_{p}^{p}$ or $\|\bm{a}\|_{p}^{p}>\|\bm{b}\|_{p}^{p}$ holds for any $p\in(0,\,p^{\prime}]$ .

Now, we are ready to present our results concerning the optimal solution set ${\rm SOL}(A,\bm{b},\sigma,p,1)$ with different choices of $p$ .

Theorem 2.9

There exists a $p^{*}\in(0,\,1]$ such that ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},$ $\sigma,0,1)$ for any $p\in(0,\,p^{*})$ . Moreover, there exists a $\overline{p}\in(0,\,p^{*})$ such that ${\rm SOL}(A,\bm{b},$ $\sigma,p,1)={\rm SOL}(A,\bm{b},\sigma,\overline{p},1)$ for any $p\in(0,\,\overline{p}]$ .

Proof. We prove the first result by contradiction. Assume that there does not exist a number $p^{*}\in(0,\,1]$ such that, for any $p\in(0,\,p^{*})$ , ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ . Consider a sequence $\{p_{k}\}$ with $0<p_{k}<1$ and $p_{k}\to 0$ as $k\to\infty$ . Thus, from the hypothesis, for each $p_{k}$ , there exists a point $\bm{x}^{k}$ such that $\bm{x}^{k}\in{\rm SOL}(A,\bm{b},\sigma,p_{k},1)$ and $\bm{x}^{k}\notin{\rm SOL}(A,\bm{b},\sigma,0,1)$ . Now, we consider the sequence $\{\bm{x}^{k}\}$ . Note that all elements in $\{\bm{x}^{k}\}$ come from the set $\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ but they are not contained in ${\rm SOL}(A,\bm{b},\sigma,0,1)$ . Since there are only finitely many points in $\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ (by Proposition 2.5), then there exists at least one point $\hat{\bm{x}}\in\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ such that $\{\bm{x}^{k}\}$ contains infinitely many $\hat{\bm{x}}$ , i.e., there exists a subsequence $\{\bm{x}^{k_{j}}\}$ so that $\bm{x}^{k_{j}}\equiv\hat{\bm{x}}$ for all $k_{j}$ . Moreover, let $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,0,1)$ . Then, for all $k_{j}$ , we have $\|\bm{x}^{k_{j}}\|_{p_{k_{j}}}^{p_{k_{j}}}\leq\|\bm{x}^{*}\|_{p_{k_{j}}}^{p_{k_{j}}}$ since $\bm{x}^{k_{j}}\in{\rm SOL}(A,\bm{b},\sigma,p_{k_{j}},1)$ . Then, we see that

[TABLE]

which implies that $\hat{\bm{x}}\in{\rm SOL}(A,\bm{b},\sigma,0,1)$ . This leads to a contradiction and completes the proof for the first result.

Next, we prove the second result. For notational simplicity, let $\mathcal{S}_{0\sim p^{*}}:=\bigcup_{0<p<p^{*}}{\rm SOL}(A,\bm{b},\sigma,p,1)$ and $s:=\|\bm{x}^{*}\|_{0}$ , where $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,0,1)$ . For any $\bm{x}\in\mathcal{S}_{0\sim p^{*}}$ , we have $\|\bm{x}\|_{0}=s$ (by the first result) and define a set as $\mathcal{C}(\bm{x}):=\big{\{}\bm{z}\in\mathcal{S}_{0\sim p^{*}}:\Delta_{k}(\bm{x},\bm{z})=0,\,\forall\,k=1,\cdots,s\big{\}}$ , where $\Delta_{k}(\cdot,\cdot)$ is defined as (10). Then, given $\bm{x}\in\mathcal{S}_{0\sim p^{*}}$ and $\bm{y}\in\mathcal{S}_{0\sim p^{*}}\setminus\mathcal{C}(\bm{x})$ , it follows from Lemma 2.8(ii) that there exists a sufficiently small $p^{(x,y)}\in(0,\,p^{*})$ such that either $\|\bm{x}\|_{p}^{p}<\|\bm{y}\|_{p}^{p}$ or $\|\bm{x}\|_{p}^{p}>\|\bm{y}\|_{p}^{p}$ holds for any $p\in(0,\,p^{(x,y)}]$ . Since $\mathcal{S}_{0\sim p^{*}}$ is contained in $\bigcup_{0<p<1}{\rm SOL}(A,\bm{b},\sigma,p,1)$ , then the number of such a pair $(\bm{x},\,\bm{y})$ is finite. Therefore, we must have a sufficiently small $\tilde{p}\in(0,\,p^{*})$ such that, for any $\bm{x}\in\mathcal{S}_{0\sim p^{*}}$ and $\bm{y}\in\mathcal{S}_{0\sim p^{*}}\setminus\mathcal{C}(\bm{x})$ , either $\|\bm{x}\|_{p}^{p}<\|\bm{y}\|_{p}^{p}$ or $\|\bm{x}\|_{p}^{p}>\|\bm{y}\|_{p}^{p}$ holds for any $p\in(0,\,\tilde{p}]$ . Now, for such $\tilde{p}$ , consider any $p^{\prime}\in(0,\,\tilde{p}]$ and let $\bm{x}^{\prime}\in{\rm SOL}(A,\bm{b},\sigma,p^{\prime},1)$ . We must have $\|\bm{x}^{\prime}\|_{p^{\prime}}^{p^{\prime}}<\|\bm{y}\|_{p^{\prime}}^{p^{\prime}}$ for any $\bm{y}\in\mathcal{S}_{0\sim p^{*}}\setminus\mathcal{C}(\bm{x}^{\prime})$ . This together with Lemma 2.8(ii) implies that for any $0<p<p^{\prime}$ , $\|\bm{x}^{\prime}\|_{p}^{p}<\|\bm{y}\|_{p}^{p}$ for any $\bm{y}\in\mathcal{S}_{0\sim p^{*}}\setminus\mathcal{C}(\bm{x}^{\prime})$ . Moreover, from Lemma 2.8(i), for any $p>0$ , $\|\bm{x}^{\prime}\|_{p}^{p}=\|\bm{y}\|_{p}^{p}$ for any $\bm{y}\in\mathcal{C}(\bm{x}^{\prime})$ . These two facts show that for any $0<p<p^{\prime}$ , $\|\bm{x}^{\prime}\|_{p}^{p}\leq\|\bm{y}\|_{p}^{p}$ for any $\bm{y}\in\mathcal{S}_{0\sim p^{*}}$ . Hence, we have $\bm{x}^{\prime}\in{\rm SOL}(A,\bm{b},\sigma,p,1)$ for any $0<p<p^{\prime}\leq\tilde{p}$ . Since $p^{\prime}$ is arbitrary and $\bm{x}^{\prime}\in{\rm SOL}(A,\bm{b},\sigma,p^{\prime},1)$ is also arbitrary, we can conclude that ${\rm SOL}(A,\bm{b},\sigma,p^{\prime},1)\subseteq{\rm SOL}(A,\bm{b},\sigma,p^{\prime\prime},1)$ for any $0<p^{\prime\prime}<p^{\prime}\leq\tilde{p}$ .

We now prove by contradiction that there must exist a $\overline{p}\in(0,\,\tilde{p}]$ such that ${\rm SOL}(A,\bm{b},\sigma,p,1)={\rm SOL}(A,\bm{b},\sigma,\overline{p},1)$ for any $p\in(0,\,\overline{p}]$ . Assume this is not true. Then, for any $p^{\prime}\in(0,\,\tilde{p}]$ , there exists a $p^{\prime\prime}\in(0,\,p^{\prime})$ such that ${\rm SOL}(A,\bm{b},\sigma,p^{\prime\prime},1)\neq{\rm SOL}(A,\bm{b},\sigma,p^{\prime},1)$ . This together with the conclusion obtained above implies that ${\rm SOL}(A,\bm{b},\sigma,p^{\prime},1)$ must be strictly contained in ${\rm SOL}(A,\bm{b},\sigma,p^{\prime\prime},1)$ , i.e., ${\rm SOL}(A,\bm{b},$ $\sigma,p^{\prime},1)\subset{\rm SOL}(A,\bm{b},\sigma,p^{\prime\prime},1)$ . With this fact, we generate a sequence $\{p^{k}\}$ as follows. Let $p^{0}=\tilde{p}$ . Then, there exists a $p_{1}\in(0,\,p_{0})$ such that ${\rm SOL}(A,\bm{b},\sigma,p_{0},1)\subset{\rm SOL}(A,\bm{b},\sigma,p_{1},1)$ . For such $p_{1}$ , there exists a $p_{2}\in(0,\,p_{1})$ such that ${\rm SOL}(A,\bm{b},\sigma,p_{1},1)$ $\subset{\rm SOL}(A,\bm{b},\sigma,p_{2},1)$ . Repeating this procedure, we can obtain a sequence $\{p^{k}\}$ such that $p_{0}>p_{1}>\cdots>0$ and ${\rm SOL}(A,\bm{b},\sigma,p_{0},1)\subset{\rm SOL}(A,\bm{b},\sigma,p_{1},1)\subset\cdots\subset{\rm SOL}(A,\bm{b},\sigma,0,1)$ . Thus, along such sequence $\{p^{k}\}$ , the number of elements of ${\rm SOL}(A,\bm{b},\sigma,p,1)$ will strictly increase and hence $\bigcup_{\{p^{k}\}}{\rm SOL}(A,\bm{b},\sigma,p,1)$ must have infinitely many elements. This leads to a contradiction and completes the proof.

Remark 2.10 (Comments on Theorem 2.9)

Theorem 2.9 is established based on the observation that the feasible set ${\rm FEA}(A,\bm{b},\sigma,1)$ of problem (4) is a polyhedron, and then, for each $j$ , $\mathbb{P}_{j}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ has at most a finite number of extreme points. Thus, one can also consider minimizing $\|\bm{x}\|_{0}$ under many other polyhedral constraints, for example, $\left\{\bm{x}\in\mathbb{R}^{n}:\|A\bm{x}-\bm{b}\|_{\infty}\leq\sigma\right\}$ and $\{\bm{x}\in\mathbb{R}^{n}:\|A\bm{x}-\bm{b}\|_{1}\leq\sigma,\,\bm{l}\leq\bm{x}\leq\bm{u}\}$ with $\bm{l}\in\mathbb{R}^{n}\cup\{-\infty\}^{n}$ , $\bm{u}\in\mathbb{R}^{n}\cup\{\infty\}^{n}$ and $\bm{l}<\bm{u}$ , to fit different scenarios in practice. Following the similar arguments presented in this paper, one can obtain the similar results in Theorem 2.9 as well as Theorem 2.12 under these polyhedral constraints. Moreover, it is also possible to extend our smoothing penalty method presented in the next section to solve problems in these cases. Here, we will omit more details to avoid overcomplicating the presentation. In addition, we are aware that the first result in Theorem 2.9 has also been discussed in [40]. However, the analysis there is much more tedious.

Based on Theorem 2.9, it is easy to give the following corollary for $\sigma=0$ (namely, the noiseless case), which has also been discussed in [31, Theorem 1].

Corollary 2.11

There exists a $p^{*}\in(0,\,1]$ such that, for any $p\in(0,\,p^{*})$ , every optimal solution of problem $\min\big{\{}\|\bm{x}\|_{p}^{p}:A\bm{x}=\bm{b}\big{\}}$ is an optimal solution of problem $\min\big{\{}\|\bm{x}\|_{0}:A\bm{x}=\bm{b}\big{\}}$ .

Theorem 2.9 says that there exists a $p^{*}\in(0,\,1]$ such that solving problem (4) with any $p\in(0,\,p^{*})$ also solves problem (4) with $p=0$ . Therefore, the constant $p^{*}$ is obviously the key for such nice relation and we are interested in estimating such $p^{*}$ in the next theorem. Our analysis is motivated by that of [31, Theorem 1], but makes use of results developed in Theorem 2.2 and Lemma 2.4 for the more general feasible set. Before proceeding, we define two constants as follows:

[TABLE]

Note that for any subset $\mathcal{I}\subseteq\{1,\cdots,n\}$ such that $A_{\mathcal{I}}$ has full column rank, $A_{\mathcal{I}}^{\top}A_{\mathcal{I}}$ is a principal submatrix of $A^{\top}A$ . Then, it follows from [27, Theorem 1.4.10] that $\lambda_{\min}(A_{\mathcal{I}}^{\top}A_{\mathcal{I}})>0$ is an eigenvalue of $A^{\top}A$ and hence $\lambda_{\min}(A_{\mathcal{I}}^{\top}A_{\mathcal{I}})\geq\lambda^{*}$ . This together with Theorem 2.2(iii) implies that

[TABLE]

From (12), (13) and Lemma 2.4, one can also see that $r\geq\tilde{r}$ .

Theorem 2.12

Let $s$ be the optimal objective value of problem (4) with $p=0$ and

[TABLE]

Then, for any $p\in(0,\,p^{*})$ , ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ .

Proof. First, we show that

[TABLE]

holds for any $p\in(0,\,p^{*})$ . Since $\frac{r}{\tilde{r}}\geq 1$ and (15) holds trivially when $\frac{r}{\tilde{r}}=1$ , then we only consider $\frac{r}{\tilde{r}}>1$ in the following two cases.

•

$\frac{r}{\tilde{r}}\leq\frac{s+1}{s}$ . In this case, $p^{*}=1$ . Since $\frac{r}{\tilde{r}}>1$ , then $\left(\frac{r}{\tilde{r}}\right)^{p}<\frac{r}{\tilde{r}}\leq\frac{s+1}{s}$ for any $p\in(0,\,1)$ .

•

$\frac{r}{\tilde{r}}>\frac{s+1}{s}$ . In this case, $p^{*}=\frac{\ln(s+1)-\ln s}{\ln r-\ln\tilde{r}}$ . Since $\frac{r}{\tilde{r}}>1$ , then $\left(\frac{r}{\tilde{r}}\right)^{p}<\left(\frac{r}{\tilde{r}}\right)^{p^{*}}=\frac{s+1}{s}$ for any $p\in(0,\,p^{*})$ .

Hence, (15) holds for any $p\in(0,\,p^{*})$ .

Next, let $\bm{x}^{*}$ be an arbitrary optimal solution of problem (4) with $p\in(0,\,p^{*})$ , i.e., $\bm{x}^{*}\in{\rm SOL}(A,\bm{b},\sigma,p,1)$ . It then follows from Lemma 2.4 that $\bm{x}^{*}$ is an extreme point of $\mathbb{P}_{j^{*}}\cap{\rm FEA}(A,\bm{b},\sigma,1)$ for some $j^{*}\in\{1,\cdots,2^{n}\}$ . Thus, we have $\frac{|x^{*}_{i}|}{\tilde{r}}\geq 1$ for any $|x^{*}_{i}|\neq 0$ . Moreover, we see that

[TABLE]

where the second inequality follows because for any $t\geq 1$ , the function $p\mapsto t^{p}$ is non-decreasing on $[0,1)$ , the equality (i) follows from (13), the third inequality follows because for any $0\leq t\leq 1$ , the function $p\mapsto t^{p}$ is non-increasing on $[0,1)$ , the equality (ii) follows again from (13), and the last inequality follows from (15). Then, from the above relation, we have that $\|\bm{x}^{*}\|_{0}=s$ and hence $\bm{x}^{*}$ is an optimal solution of problem (4) with $p=0$ . This completes the proof.

Before closing this section, we present a simple example to illustrate our previous theoretical results.

Example 2.13

Let $A=\begin{bmatrix}[r]1&1&1\\ 1&1&-1\end{bmatrix}$ , $\bm{b}=\begin{bmatrix}3\\ 3\end{bmatrix}$ and $\sigma=1$ . Then, we consider

[TABLE]

with $0\leq p\leq 1$ and $q=1,\,2,\,\infty$ . Next, for each $q$ , we discuss the optimal solution sets of problem (16) with different choices of $p$ .

For $q=1$ , the feasible set of (16) is

[TABLE]

Then,

[TABLE]

For $q=2$ , the feasible set of (16) is

[TABLE]

Then,

[TABLE]

For $q=\infty$ , the feasible set of (16) is

[TABLE]

Then,

[TABLE]

From this example, one can easily see that every optimal solution $\bm{x}^{*}$ of (16) is at the boundary of the feasible set for $0<p\leq 1$ and there is a $\alpha\in(0,1]$ such that $\alpha\bm{x}^{*}$ is at the boundary of the feasible set for $p=0$ , as claimed in Theorem 2.2(i). Moreover, every optimal solution of (16) with $0<p<1$ is exactly a sparsest solution over $\{\bm{x}\in\mathbb{R}^{3}:\|A\bm{x}-\bm{b}\|_{q}\leq 1\}$ for $q=1,\,2,\,\infty$ , while an optimal solution of (16) with $p=1$ may not be a sparest one. This shows the potential advantage of using the $L_{p}$ -norm ( $0<p<1$ ) to approximate the $L_{0}$ -norm. In particular, when $q=1$ , one can further estimate $p^{*}$ by (14) for this example. Indeed, it is easy to see that $s=1$ . Then, from (11), we compute that $r=\frac{1+3\sqrt{2}}{\sqrt{2}}$ . Moreover, one can verify that

[TABLE]

Thus, it follows from (12) that $\tilde{r}=\frac{1}{2}$ . Now, using (14), since $\frac{r}{\tilde{r}}=6+\sqrt{2}>\frac{s+1}{s}=2$ , we have

[TABLE]

Recalling Theorem 2.12, we know that every optimal solution of (16) with $p\in\left(0,\,\frac{\ln 2}{\ln(6+\sqrt{2})}\right)$ shall be an optimal solution of (16) with $p=0$ . This is clearly evident in (17). In fact, for this example, every optimal solution of (16) with $0<p<1$ is an optimal solution of (16) with $p=0$ . This shows that $p^{*}$ given in (14) may not be the optimal upper bound of $p$ such that ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ for any $p\in(0,\,p^{*})$ . In addition, our current estimate $p^{*}$ in (14) depends on the knowledge on the optimal value $s$ , which may be unknown or difficult to find in practice. Fortunately, we observe that $p^{*}$ , viewed as a function of $s$ , is actually decreasing when $\frac{\ln(s+1)-\ln s}{\ln r-\ln\tilde{r}}\leq 1$ . Thus, one may estimate a proper upper bound $\tilde{s}$ for the true optimal value $s$ (i.e., $\tilde{s}\geq s$ ) and compute $\tilde{p}^{*}=\frac{\ln(\tilde{s}+1)-\ln\tilde{s}}{\ln r-\ln\tilde{r}}$ satisfying $\tilde{p}^{*}\leq p^{*}$ . It then follows from Theorem 2.12 that ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ for any $p\in(0,\,\tilde{p}^{*})$ . But it should be noticed that such $\tilde{p}^{*}$ can be more conservative. Improving estimations of $p^{*}$ and $\tilde{p}^{*}$ will be an interesting research topic in the future.

3 A smoothing penalty method

In this section, we propose a smoothing penalty method for solving the $L_{1}$ -constrained problem (4) with $0<p<1$ . Before proceeding, we would like to point out that the smoothing penalty method presented in this paper can be extended without much difficulty to solve the $L_{\infty}$ -constrained problem, namely, problem (1) with $0<p<1$ and $q=\infty$ . Because the $L_{\infty}$ -constrained problem is similar to the $L_{1}$ -constrained problem in the sense that both constraints $\|A\bm{x}-\bm{b}\|_{1}\leq\sigma$ and $\|A\bm{x}-\bm{b}\|_{\infty}\leq\sigma$ are polyhedral constraints, and the functions $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{1}$ and $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{\infty}$ are piecewise linear. On the other hand, for $1<q<\infty$ , the function $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|^{q}_{q}$ is continuously differentiable. Then, one can readily extend the smoothing penalty method proposed in [12] to solve problem (1) with $0<p<1$ and $1<q<\infty$ . However, the approach in [12] cannot be directly adapted for $q\in\{1,\infty\}$ due to the nonsmoothness of the function $\bm{x}\mapsto\|A\bm{x}-\bm{b}\|_{q}$ in these two cases. In view of the above, in this paper, we consider an alternative smoothing penalty method for solving the $L_{1}$ -constrained problem and omit the discussions on solving the $L_{\infty}$ -constrained problem to save space.

We first study the first-order optimality conditions for problem (4) with $0<p<1$ . For simplicity, from now on, let $\Phi(\bm{x}):=\|\bm{x}\|_{p}^{p}$ . Then, problem (4) with $0<p<1$ can be equivalently written as follows:

[TABLE]

It is known from the generalized Fermat’s rule [33, Theorem 10.1] that, at any local minimizer $\bar{\bm{x}}$ of (18) (hence (4)), the following first-order necessary condition holds:

[TABLE]

This motivates the following definition.

Definition 3.1 (Stationary point of problem (4) with $0<p<1$ )

A point $\bm{x}^{*}$ is said to be a stationary point of problem (4) with $0<p<1$ if $\bm{x}^{*}\in{\rm FEA}(A,\bm{b},\sigma,1)$ and (19) is satisfied with $\bm{x}^{*}$ in place of $\bar{\bm{x}}$ .

Note that finding an optimal solution of problem (4) with $0<p<1$ is NP-hard [11, 21]. Therefore, we shall focus on finding a stationary point of this problem. To this end, we introduce the following auxiliary penalty problem:

[TABLE]

where $\lambda>0$ is the penalty parameter and $(\cdot)_{+}:=\max\{\cdot,\,0\}$ . This problem is indeed an exact penalty problem for problem (4) with $0<p<1$ . The detailed analysis for the exact penalization results regarding global and local minimizers is given in Appendix 8. However, problem (20) is still not conceivably solvable because both parts in (20) are nonsmooth, and moreover, $\Phi$ is nonconvex and non-Lipschitz. We then consider a partially smoothing problem of (20) as follows:

[TABLE]

where $\mu$ , $\nu>0$ are smoothing parameters and

[TABLE]

with

[TABLE]

Note that $g_{\mu}(s)$ and $h_{\nu}(t)$ are the smoothing functions of $(s)_{+}$ and $|t|$ , respectively (see Figure 1), and they have the following nice properties:

[TABLE]

More details on these smoothing functions can be found in [10, Section 3] and references therein. Thus, the composite function $f_{\lambda,\mu,\nu}(\bm{x})$ is indeed obtained by applying the smoothing technique twice. Hence, it is continuously differentiable and can be viewed as a smoothing function of $\lambda(\|A\bm{x}-\bm{b}\|_{1}-\sigma)_{+}$ . One can also show that

[TABLE]

Moreover, it is worth mentioning that when $\sigma=0$ , the auxiliary penalty problem (20) reduces to $\min\limits_{\bm{x}\in\mathbb{R}^{n}}\left\{\Phi(\bm{x})+\lambda\|A\bm{x}-\bm{b}\|_{1}\right\}$ . Then, the smoothing function $g_{\mu}$ of $(\cdot)_{+}$ is no longer needed and the subsequent analysis can also be simplified in this special case. Now, based on (21), we are ready to present a smoothing penalty method as Algorithm 1 for solving problem (4) with $0<p<1$ . We call it SPeL1 for short in the rest of this paper.

The reader may have observed that, since problem (20) is a penalty counterpart of problem (4) and problem (21) is a partially smoothing counterpart of problem (20), our method actually adapts the penalty strategy and the smoothing strategy at the same time for solving the nonconvex nonsmooth non-Lipschitz constrained problem (4) with $0<p<1$ . Specifically, in our method, at each iteration, we solve problem (21) approximately with given $(\lambda,\mu,\nu)$ , and then update $\bm{x}$ and $(\lambda,\mu,\nu)$ . The cooperation of these two strategies indeed provides an efficient practical way to solve problem (4) with $0<p<1$ . This circumvents the potential disadvantages of the traditional penalty approach that directly solves the penalty problem (20) with an exact penalty parameter $\lambda^{*}$ , because (i) it is still not easy to solve problem (20) efficiently; (ii) it is, in general, hard to estimate the exact penalty parameter $\lambda^{*}$ and the overestimation may make the penalty problem (20) ill-conditioned. The convergence result that characterizes a cluster point of the sequence generated by the SPeL1 in Algorithm 1 is shown in the next theorem. We should note that, though the proofs are motivated by those in [12, Theorem 4.2] and [29, Theorem 2], the technical details become much more involved since our smoothing function $f_{\lambda,\mu,\nu}$ is obtained by a composition of two smoothing functions $g_{\mu}$ and $H_{\nu}$ .

For the ease of future reference, we write down the gradients of $f_{\lambda,\mu,\nu}$ and $H_{\nu}$ as well as the derivatives of $g_{\mu}$ and $h_{\nu}$ as follows:

[TABLE]

Moreover, we claim that $\Phi$ is regular at any $\bm{x}\in\mathbb{R}^{n}$ as follows. Let $\phi(t)=|t|^{p}$ for any $t\in\mathbb{R}$ . It is easy to see that $\phi(t)$ is regular at any $t\neq 0$ , because $\phi(t)$ is smooth in a neighborhood of any $t\neq 0$ ; see [33, Exercise 8.8] and [33, Corollary 8.11]. For $t=0$ , it follows from [12, Lemma 2.5] and its proof that $\widehat{\partial}\phi(0)=\partial\phi(0)=\partial^{\infty}\phi(0)=\mathbb{R}$ . Moreover, from the definition of the horizon cone (see [33, Definition 3.3]), we have that $\widehat{\partial}\phi(0)^{\infty}=\mathbb{R}$ . Using these facts and [33, Corollary 8.11], we see that $\phi(t)$ is also regular at $t=0$ . Therefore, it follows from [33, Proposition 10.5] that $\Phi$ is regular at any $\bm{x}\in\mathbb{R}^{n}$ .

Theorem 3.2

Suppose that $\rho>1$ and $0<\theta<1$ are chosen such that $\theta\rho\leq 1$ . Let $\{\bm{x}^{k}\}^{\infty}_{k=0}$ be the sequence generated by the SPeL1 in Algorithm 1. Then, the following statements hold.

(i)

$\{\bm{x}^{k}\}$ * is bounded.*

(ii)

Any cluster point $\bm{x}^{*}$ of $\{\bm{x}^{k}\}$ is a feasible point of problem (4) with $0<p<1$ .

(iii)

Suppose that $\bm{x}^{*}$ is a cluster point of $\{\bm{x}^{k}\}$ and it holds at $\bm{x}^{*}$ that

[TABLE]

Then, $\bm{x}^{*}$ is a stationary point of problem (4) with $0<p<1$ .

Proof. Statement (i). First, we see that

[TABLE]

where the first inequality follows from the nonnegativity of $f_{\lambda_{k},\mu_{k},\nu_{k}}(\bm{x}^{k+1})$ (since $g_{\mu}(s)\geq 0$ for all $s$ ), the second inequality follows from (27), the third inequality follows from Step 1 in Algorithm 1, the fourth inequality follows from (24) and the last inequality follows from $\theta\rho\leq 1$ . This together with the level-boundedness of $\Phi$ (recall that $\Phi(\bm{x}):=\|\bm{x}\|_{p}^{p}$ ) implies that $\{\bm{x}^{k}\}$ is bounded.

Statement (ii). Since $\{\bm{x}^{k}\}$ is bounded, there exists at least one cluster point. Suppose that $\bm{x}^{*}$ is a cluster point of $\{\bm{x}^{k}\}$ and let $\{\bm{x}^{k_{i}}\}$ be a convergent subsequence such that $\lim\limits_{i\to\infty}\bm{x}^{k_{i}}=\bm{x}^{*}$ . Note that

[TABLE]

where the first inequality follows from (22), (23) and the fact that $g_{\mu}$ is non-decreasing, and the last inequality follows from (24). Then,

[TABLE]

Taking limit in above inequality along $\{\bm{x}^{k_{i}}\}$ and recalling that $\lambda_{k_{i}-1}\rightarrow\infty$ , $\mu_{k_{i}-1}\to 0$ , $\nu_{k_{i}-1}\to 0$ (see Step 3 in Algorithm 1), we see that $\|A\bm{x}^{*}-\bm{b}\|_{1}\leq\sigma$ . Hence, $\bm{x}^{*}$ is a feasible point of (4) with $0<p<1$ .

Statement (iii). We next show that $\bm{x}^{*}$ is a stationary point of problem (4) with $0<p<1$ . For simplicity, let $\bm{a}_{j}\in\mathbb{R}^{n}$ ( $j=1,\cdots,m$ ) be the column vector formed from the $j$ th row of $A$ , i.e., $A=[\bm{a}_{1},\cdots,\bm{a}_{m}]^{\top}\in\mathbb{R}^{m\times n}$ . Moreover, let $\bm{y}^{k+1}:=\bm{x}^{k,l_{k}+1}$ . Then, $\lim\limits_{i\to\infty}\bm{y}^{k_{i}}=\bm{x}^{*}$ thanks to $\bm{x}^{k_{i}}\to\bm{x}^{*}$ and (26) with $\epsilon_{k}\to 0$ . Thus, from (25) and (28), we see that for any $k\geq 1$ , there exists a $\bm{\xi}^{k}\in\partial\Phi(\bm{y}^{k})$ such that

[TABLE]

In the following, we consider two cases: $\|A\bm{x}^{*}-\bm{b}\|_{1}<\sigma$ and $\|A\bm{x}^{*}-\bm{b}\|_{1}=\sigma$ .

Case 1. In this case, we suppose that $\|A\bm{x}^{*}-\bm{b}\|_{1}<\sigma$ . Since $\|A\bm{x}^{k_{i}}-\bm{b}\|_{1}\to\|A\bm{x}^{*}-\bm{b}\|_{1}$ , then, for any $0<\gamma<\sigma-\|A\bm{x}^{*}-\bm{b}\|_{1}$ , there exists a sufficiently large $K_{\gamma}>0$ such that $\big{|}\|A\bm{x}^{k_{i}}-\bm{b}\|_{1}-\|A\bm{x}^{*}-\bm{b}\|_{1}\big{|}\leq\gamma$ for all $k_{i}\geq K_{\gamma}$ . Note that

[TABLE]

where the first inequality follows from (23), the equality follows from $\frac{\nu_{k}}{\mu_{k}}=\frac{\theta\nu_{k-1}}{\theta\mu_{k-1}}=\cdots=\frac{\nu_{0}}{\mu_{0}}$ , the second inequality holds for all $k_{i}\geq K_{\gamma}$ and the last inequality follows whenever $k_{i}\geq\widetilde{K}_{\gamma}$ for some $\widetilde{K}_{\gamma}\geq K_{\gamma}$ because $\mu_{k_{i}}\to 0$ and $\|A\bm{x}^{*}-\bm{b}\|_{1}-\sigma+\gamma<0$ . This together with (29) implies that $g^{\prime}_{\mu_{k_{i}-1}}\big{(}H_{\nu_{k_{i}-1}}(A\bm{x}^{k_{i}}-\bm{b})-\sigma\big{)}=0$ for all sufficiently large $k_{i}$ . Hence, (32) reduces to $\|\bm{\xi}^{k_{i}}\|\leq\epsilon_{k_{i}-1}$ for all sufficiently large $k_{i}$ . Then, we have from (5) that $0=\bm{\xi}^{*}\in\partial\Phi(\bm{x}^{*})$ . This together with $\mathcal{N}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})=\{0\}$ (since $\|A\bm{x}^{*}-\bm{b}\|_{1}<\sigma$ ) implies that

[TABLE]

Moreover, since $\Phi$ and $\delta_{{\rm FEA}(A,\bm{b},\sigma,1)}$ are regular, then it follows from [33, Corollary 8.11] and [33, Exercise 8.14] that $\partial\Phi(\bm{x}^{*})=\widehat{\partial}\Phi(\bm{x}^{*})$ and $\mathcal{N}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})=\partial\delta_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})=\widehat{\partial}\delta_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})$ . Using these facts and recalling [33, Theorem 8.6], [33, Corollary 10.9], we have

[TABLE]

which implies that $\bm{x}^{*}$ is a stationary point of problem (4) with $0<p<1$ .

Case 2. In this case, we suppose that $\|A\bm{x}^{*}-\bm{b}\|_{1}=\sigma$ . For such $\bm{x}^{*}$ , one can follow [25, Theorem 1.3.5 in Section D] to compute that

[TABLE]

For simplicity, let $\tilde{t}_{k}:=\lambda_{k-1}g^{\prime}_{\mu_{k-1}}(H_{\nu_{k-1}}(A\bm{x}^{k}-\bm{b})-\sigma)$ and $t^{j}_{k}:=\tilde{t}_{k}h^{\prime}_{\nu_{k-1}}([A\bm{x}^{k}-\bm{b}]_{j})$ for $j=1,\cdots,m$ . Also, let $\mathcal{J}^{0}:=\{j:[A\bm{x}^{*}-\bm{b}]_{j}=0\}$ , $\mathcal{J}^{+}:=\{j:[A\bm{x}^{*}-\bm{b}]_{j}>0\}$ and $\mathcal{J}^{-}:=\{j:[A\bm{x}^{*}-\bm{b}]_{j}<0\}$ . Then, (32) is equivalent to

[TABLE]

Since $\bm{x}^{k_{i}}\to\bm{x}^{*}$ and $\nu_{k_{i}-1}\to 0$ , there exists a sufficiently large $K>0$ such that for all $k_{i}\geq K$ , we have $[A\bm{x}^{k_{i}}-\bm{b}]_{j}>0$ and $\frac{2}{\nu_{k_{i}-1}}[A\bm{x}^{k_{i}}-\bm{b}]_{j}\geq 1$ for all $j\in\mathcal{J}^{+}$ , and have $[A\bm{x}^{k_{i}}-\bm{b}]_{j}<0$ and $\frac{2}{\nu_{k_{i}-1}}[A\bm{x}^{k_{i}}-\bm{b}]_{j}\leq-1$ for all $j\in\mathcal{J}^{-}$ . Thus, it follows from (30) that for all $k_{i}\geq K$ , we have $h^{\prime}_{\nu_{k_{i}-1}}([A\bm{x}^{k_{i}}-\bm{b}]_{j})=1$ for all $j\in\mathcal{J}^{+}$ and $h^{\prime}_{\nu_{k_{i}-1}}([A\bm{x}^{k_{i}}-\bm{b}]_{j})=-1$ for all $j\in\mathcal{J}^{-}$ . Moreover, for all $k_{i}\geq 1$ , we see from (29) and (30) that $g^{\prime}_{\mu_{k_{i}-1}}(H_{\nu_{k_{i}-1}}(A\bm{x}^{k_{i}}-\bm{b})-\sigma)\geq 0$ and $h^{\prime}_{\nu_{k_{i}-1}}([A\bm{x}^{k_{i}}-\bm{b}]_{j})\in[-1,1]$ for all $j$ . Then, for all $k_{i}\geq K$ , we have that $t^{j}_{k_{i}}=\tilde{t}_{k_{i}}\geq 0$ for all $j\in\mathcal{J}^{+}$ , $t^{j}_{k_{i}}=-\tilde{t}_{k_{i}}\leq 0$ for all $j\in\mathcal{J}^{-}$ and $t^{j}_{k_{i}}\in[-\tilde{t}_{k_{i}},\,\tilde{t}_{k_{i}}]$ for all $j\in\mathcal{J}^{0}$ .

We next prove by contradiction that $\{\bm{\xi}^{k_{i}}\}$ is bounded. Suppose that $\{\bm{\xi}^{k_{i}}\}$ is unbounded. Without loss of generality, we assume that $\|\bm{\xi}^{k_{i}}\|\to\infty$ and that $\frac{1}{\|\bm{\xi}^{k_{i}}\|}\bm{\xi}^{k_{i}}\to\bm{\xi}^{*}$ for some $\bm{\xi}^{*}$ . Then, it follows from (34) that

[TABLE]

Moreover, from the discussions in the last paragraph, for all $k_{i}\geq K$ , we have that $t_{k_{i}}^{j}/\|\bm{\xi}^{k_{i}}\|=\tilde{t}_{k_{i}}/\|\bm{\xi}^{k_{i}}\|\geq 0$ for all $j\in\mathcal{J}^{+}$ , $t_{k_{i}}^{j}/\|\bm{\xi}^{k_{i}}\|=-\tilde{t}_{k_{i}}/\|\bm{\xi}^{k_{i}}\|\leq 0$ for all $j\in\mathcal{J}^{-}$ and $t_{k_{i}}^{j}/\|\bm{\xi}^{k_{i}}\|\in\left[-\tilde{t}_{k_{i}}/\|\bm{\xi}^{k_{i}}\|,\,\tilde{t}_{k_{i}}/\|\bm{\xi}^{k_{i}}\|\right]$ for all $j\in\mathcal{J}^{0}$ . Then, it follows from (33) that

[TABLE]

for all $k_{i}\geq K$ . Then, passing to the limit in (35) along $\{\bm{x}^{k_{i}}\}$ , together with $\frac{\epsilon_{k_{i}-1}}{\|\bm{\xi}^{k_{i}}\|}\to 0$ and the closeness of $\mathcal{N}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})$ , it is not hard to see that

[TABLE]

Since $\bm{\xi}^{*}\neq 0$ due to $\|\bm{\xi}^{*}\|=1$ , this is in contradiction to (31). Hence, $\{\bm{\xi}^{k_{i}}\}$ is bounded. Without loss of generality, assume that $\bm{\xi}^{k_{i}}\to\bm{\xi}^{*}$ . Then, passing to the limit in (34) along $\{\bm{x}^{k_{i}}\}$ and $\{\bm{y}^{k_{i}}\}$ , making use of (33) and the closeness of $\mathcal{N}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}^{*})$ , recalling (5), we obtain that

[TABLE]

Thus, following the similar arguments in Case 1, one can show that $\bm{x}^{*}$ is a stationary point of problem (4) with $0<p<1$ . This completes the proof.

Remark 3.3 (Comments on condition (31))

Condition (31) used for Theorem 3.2(iii) is actually a classic constraint qualification for nonconvex nonsmooth optimization problems; see [33, Theorem 8.15]. Note that, for any $\bm{x}^{*}\in{\rm FEA}(A,\bm{b},\sigma,1)$ , we have

[TABLE]

Moreover, recall from [12, Lemma 2.5(ii)] that

[TABLE]

Thus, condition (31) obviously holds at a point $\bm{x}^{*}$ satisfying $\|A\bm{x}^{*}-\bm{b}\|_{1}<\sigma$ . For a point $\bm{x}^{*}$ satisfying $\|A\bm{x}^{*}-\bm{b}\|_{1}=\sigma$ , one sufficient condition for (31) is that, for some $i\in\mathrm{supp}(\bm{x}^{*})$ , $[A^{\top}\bm{d}]_{i}\neq 0$ holds for any $\bm{d}\in\partial\|\cdot\|_{1}(A\bm{x}^{*}-\bm{b})$ , i.e., $\mathrm{Diag}(\bm{x}^{*})A^{\top}\bm{d}\neq 0$ for any $\bm{d}\in\partial\|\cdot\|_{1}(A\bm{x}^{*}-\bm{b})$ .

To end this section, we briefly discuss the method for approximately solving the smoothing penalty problem (21) such that conditions (25)–(27) hold. Note that, for any given $(\lambda,\mu,\nu)$ , $F_{\lambda,\mu,\nu}$ is a continuous function that consists of a nonconvex nonsmooth non-Lipschitz function $\Phi$ and a smooth function $f_{\lambda,\mu,\nu}$ . It is also not hard to verify that the gradient of $f_{\lambda,\mu,\nu}$ is Lipschitz continuous. Moreover, $F_{\lambda,\mu,\nu}$ is level-bounded because $\Phi$ is level-bounded and $f_{\lambda,\mu,\nu}$ is nonnegative since $g_{\mu}$ is nonnegative. Hence, the well-known proximal gradient method and its variants are suitably applied for solving (21) with convergence guarantee; see, for example, [1, 2, 12, 39]. In our numerical experiments, we follow [12] to adapt the nonmonotone proximal gradient (NPG) method. The NPG method is basically the proximal gradient method with a non-monotone line search technique and allows the occasional increases in objective. By incorporating this technique, the NPG has been shown to have more favorable numerical performance over the monotone version in many applications; see, for example, [22, 38, 39]. The iterative scheme of the NPG for solving (21) with $(\lambda_{k},\mu_{k},\nu_{k})$ is given as follows:

Choose $L_{k}^{\max}\geq L_{k}^{\min}>0$ , $\tau>1$ , $c>0$ , and an integer $N\geq 0$ . At the $l$ -th ( $l\geq 0$ ) iteration, choose $L_{k,l}^{0}\in[L_{k}^{\min},L_{k}^{\max}]$ and find the smallest nonnegative integer $i_{l}$ such that

$\left\{\begin{aligned} &\bm{w}\in\arg\min\limits_{\bm{x}\in\mathbb{R}^{n}}\Big{\{}\Phi(\bm{x})+\langle\nabla f_{\lambda_{k},\mu_{k},\nu_{k}}(\bm{x}^{k,l}),\,\bm{x}\rangle+\frac{\tau^{i_{l}}L_{k,l}^{0}}{2}\|\bm{x}-\bm{x}^{k,l}\|^{2}\Big{\}},\\ &F_{\lambda_{k},\mu_{k},\nu_{k}}(\bm{w})-\max\limits_{[l-N]_{+}\leq i\leq l}F_{\lambda_{k},\mu_{k},\nu_{k}}(\bm{x}^{k,i})\leq-\frac{c}{2}\|\bm{w}-\bm{x}^{k,l}\|^{2}.\end{aligned}\right.$

(36)

Then, set $\bm{x}^{k,l+1}=\bm{w}$ and $\bar{L}_{k,l}=\tau^{i_{l}}L_{k,l}^{0}$ .

One can also show that, for any given $(\lambda_{k},\mu_{k},\nu_{k})$ and $\epsilon_{k}$ , a point $\bm{x}^{k,l_{k}}$ satisfying conditions (25)–(27) can be found by the NPG within a finite number of iterations. Indeed, it follows from [12, Proposition A.1(i)] that (27) holds for all $l\geq 0$ . Moreover, from the optimality condition of (36), we see that

[TABLE]

which implies that

[TABLE]

This together with the boundedness of $\{\bar{L}_{k,l}\}_{l\geq 0}$ (see [12, Proposition A.1(ii)]) and $\|\bm{x}^{k,l+1}-\bm{x}^{k,l}\|\to 0$ as $l\to\infty$ (see [12, Theorem A.1]) implies that (25) and (26) hold when $l$ is sufficiently large. In view of the above, the sequence $\{\bm{x}^{k}\}$ generated by the SPeL1 in Algorithm 1 is well-defined.

4 Numerical simulations

In this section, we conduct some numerical experiments for problem (4) with $0<p<1$ on finding sparse solutions to implicitly illustrate the theoretical results established in Section 2 and show the efficiency of our SPeL1 in Algorithm 1. All experiments are run in Matlab R2016a on a workstation with Intel(R) Xeon(R) Processor [email protected] and 64GB of RAM, equipped with 64-bit Windows 10 OS.

For the SPeL1, we set $\lambda_{0}=\mu_{0}=\nu_{0}=1$ and $\bm{x}^{0}=\bm{x}^{\mathrm{feas}}=A^{\dagger}\bm{b}$ , where the computation of $A^{\dagger}\bm{b}$ is not counted in the CPU time below. At the $k$ th outer iteration, we compute

[TABLE]

Then, based on these quantities, we set

[TABLE]

The initial tolerance for the subproblem is set to $\epsilon_{0}=10^{-3}$ and $\epsilon_{k+1}$ is updated as $\max\{\theta\epsilon_{k},10^{-8}\}$ (instead of $\theta\epsilon_{k}$ ) in our implementation. Finally, we terminate the SPeL1 when

[TABLE]

Once the SPeL1 is terminated and returns an approximate solution $\bm{x}^{*}$ , we also perform a refinement step by setting $x_{i}^{*}=0$ if $|x_{i}^{*}|/\|\bm{x}^{*}\|_{\infty}<10^{-8}$ to improve the quality of the approximate solution.

For solving each subproblem (21) with $(\lambda_{k},\mu_{k},\nu_{k})$ in the SPeL1, we adapt the NPG described in (36) with $L_{k}^{\min}=10^{-6}$ , $L_{k}^{\max}=\big{(}\frac{m}{\mu_{k}}+\frac{2}{\nu_{k}}\big{)}\lambda_{k}\|A\|^{2}$ , $\tau=2$ , $c=10^{-4}$ and $N=2$ . Moreover, we set $L_{k,0}^{0}=1$ and, for any $l\geq 1$ ,

[TABLE]

with $\bm{x}^{k,-1}=\bm{x}^{k,0}$ , where

[TABLE]

The NPG method is terminated when the number of iterations exceeds 1000 or

[TABLE]

Note from (37) that if the first inequality above holds, condition (25) is then approximately satisfied.

In the following experiments, we consider randomly generated instances. Given a dimensional triple $(m,n,s)$ , we randomly generate an instance as follows. First, we generate a matrix $A\in\mathbb{R}^{m\times n}$ with i.i.d. standard Gaussian entries and then normalize $A$ so that each column of $A$ has unit norm. We next choose a subset $\mathcal{S}\subset\{1,\cdots,n\}$ of size $s$ uniformly at random and generate an $s$ -sparse vector $\hat{\bm{x}}\in\mathbb{R}^{n}$ , which has i.i.d. standard Gaussian entries on $\mathcal{S}$ and zeros on $\mathcal{S}^{c}$ . Then, we generate the vector $\bm{b}\in\mathbb{R}^{m}$ by setting $\bm{b}=A\hat{\bm{x}}+\delta\bm{\xi}$ , where $\delta>0$ is a scaling parameter and $\bm{\xi}\in\mathbb{R}^{m}$ is the noisy vector with each entry $\xi_{i}$ independently following certain distribution. We shall consider two cases:

•

Case 1. We use the standard Gaussian distribution via the Matlab command: xi = randn(m,1).

•

Case 2. We use the Student’s $t(2)$ distribution via the Matlab command: xi = trnd(2,m,1).

Finally, we set $\sigma=\delta\|\bm{\xi}\|_{1}$ so that $\hat{\bm{x}}\in{\rm FEA}(A,\bm{b},\sigma,1)$ . In particular, for such $\sigma$ , we have observed from our simulations that all random instances satisfy $\|\bm{b}\|_{1}>\sigma$ and hence $0\notin{\rm FEA}(A,\bm{b},\sigma,1)$ .

Table 4 presents the numerical results of the SPeL1 for solving problem (4) with $0<p<1$ , where we use $\delta=10^{-3}$ and consider different choices of $(m,n,s)$ and $p$ under different noisy cases. In this table, “nnz” denotes the number of nonzero entries in the refined terminating solution $\bm{x}^{*}$ ; “rank” denotes the rank of $A_{\mathcal{J}}$ with ${\mathcal{J}}=\mathrm{supp}(\bm{x}^{*})$ ; $\mathbf{err}_{1}:=\max\big{\{}\|\bm{x}^{*}\|_{\infty}-(\lambda_{\min}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}}))^{-\frac{1}{2}}(\sigma+\|\bm{b}\|_{2}),\,(|\mathcal{J}|\lambda_{\max}(A_{\mathcal{J}}^{\top}A_{\mathcal{J}}))^{-\frac{1}{2}}(\|\bm{b}\|_{1}-\sigma)-\|\bm{x}^{*}\|_{\infty},\,0\big{\}}$ ; and $\mathbf{err}_{2}:=\sigma-\|A\bm{x}^{*}-\bm{b}\|_{1}$ . All results presented are the average of 10 independent instances for each $(m,n,s)$ and we display the rounding numbers for “nnz” and “rank”. From Table 4, one can see that nnz $=$ rank, $\mathbf{err}_{1}=0$ and $\mathbf{err}_{2}\approx 0$ always hold, clearly matching Theorem 2.2 established for an optimal solution of problem (4) with $0<p<1$ . This implies that our SPeL1 is able to find a ‘good’ stationary point of problem (4) with $0<p<1$ , which has important properties of an optimal solution.

We further generate one random instance for each $(m,n,s)$ under different noisy cases, and then apply our SPeL1 to solve problem (4) with different $p$ . The number of nonzero entries in the approximate solution obtained for different $p$ are presented in Figure 2. From this figure, we see that solving problem (4) with a smaller $p$ always gives a sparser approximate solution, and the sparsity is almost unchanged and is close to the sparsity of $\hat{\bm{x}}$ when $p$ is smaller than a certain threshold. This observation implicitly matches Theorem 2.9, which says that ${\rm SOL}(A,\bm{b},\sigma,p,1)\subseteq{\rm SOL}(A,\bm{b},\sigma,0,1)$ and ${\rm SOL}(A,\bm{b},\sigma,p,1)$ remains unchanged for any sufficiently small $p$ , and shows the potential advantage of solving problem (4) with a small $p$ for finding a sparse solution. Moreover, in practice, such $p$ may not be necessarily too small. From our experiments, we observe that $p=0.5$ is small enough for problem (4) to give a sparse solution.

Next, we consider using model (4) to recover a sparse solution of an underdetermined linear system from noisy measurements, and compare its performance with that of using the widely-studied $L_{2}$ -constrained problem (see, for example, [3, 12, 13, 35]):

[TABLE]

We will solve problem (39) with $0<p<1$ by the smoothing penalty method111The Matlab codes implemented by the authors in [12] are available at http://www.mypolyuweb.hk/~tkpong/Exact_lp_codes/ proposed in [12] and call it SPeL2 for short. All parameters in the SPeL2 are chosen as the default settings, except that we terminate its subroutine NPG when the inner iteration number exceeds 1000 to save the cost for solving the subproblem, while maintaining the quality of the eventual solution. Moreover, we initialize the SPeL2 at the same point as the SPeL1 and terminate the SPeL2 at the $k$ th iteration when $\max\left\{\eta_{1}^{k},\,\eta_{2}^{k},\,\eta_{4}^{k}\right\}<10^{-8}$ , where $\eta_{1}^{k}$ , $\eta_{2}^{k}$ are defined in (38) and $\eta_{4}^{k}:=\max\big{\{}\|A\bm{x}^{k+1}-\bm{b}\|_{2}-\sigma,\,0\big{\}}$ . We also adapt the refinement step for the approximate solution obtained by the SPeL2 to improve the quality of the approximate solution.

In comparisons below, we use $p=0.5$ and consider different $(m,n,s)$ and $\delta$ under different noisy cases. For each $(m,n,s)$ and $\delta$ , we randomly generate $A$ , $\hat{\bm{x}}$ , $\bm{b}$ , $\bm{\xi}$ as described above, but set $\sigma=\delta\|\bm{\xi}\|_{1}$ for (4) and set $\sigma=\delta\|\bm{\xi}\|$ for (39) so that both resulting feasible sets of (4) and (39) will contain the sparse vector $\hat{\bm{x}}$ as a boundary point. The computational results are reported in Table 4, where “nnz” denotes the number of nonzero entries in the refined terminating solution $\bm{x}^{*}$ ; “feas” denotes the deviation of $\bm{x}^{*}$ from the constraint, which is given by $\eta_{3}^{k}$ for (4) and $\eta_{4}^{k}$ for (39); “recerr” denotes the relative recovery error $\|\bm{x}^{*}-\hat{\bm{x}}\|_{2}/\|\hat{\bm{x}}\|_{2}$ ; “time” denotes the computational time (in seconds). All results reported are the average of 10 independent instances for each $(m,n,s)$ and $\delta$ . One can observe from this table that for the Gaussian noisy case, the performance of our SPeL1 is comparable with that of the SPeL2 with respect to the relative recovery error, while for the Student’s $t(2)$ noisy case, our SPeL1 gives sparse solutions with smaller relative recovery errors for all instances. It is worth noting that, for the problem of recovering sparse solutions, even marginal improvements on recovery error could be very hard. Moreover, all approximate solutions obtained by the SPeL1 are exactly the feasible points of (4) and the sparsity of each solution is closer to that of the true sparse vector for most cases.

To better visualize the recovery performances of SPeL1 and SPeL2, we generate more instances to test and plot the “frequency of success” for each method with different $p$ . Specifically, we fix $m=128$ , $n=512$ and vary $s$ from 20 to 70. The noisy level is set to $\delta=10^{-3}$ . For each $(m,n,s)$ , we generate 500 independent instances, and for each instance, we run each method to obtain an approximate solution $\bm{x}^{*}$ and consider the recovery successful if $\|\bm{x}^{*}-\hat{\bm{x}}\|_{2}/\|\hat{\bm{x}}\|_{2}<5\times 10^{-3}$ . The results of the experiments are presented in Figure 3. Note that when the number of measurements is fixed, a larger $s$ generally leads to a more difficult recovery problem and thus the successful rate would be decayed, as shown in the figure. Moreover, one can see that for the Gaussian noisy case, the successful rate of our SPeL1 is comparable with that of the SPeL2, while for the Student’s t(2) noisy case, our SPeL1 can give better successful rates especially when $p$ is small. This highlights the potential advantage of our approach for recovering a sparse solution under non-Gaussian noisy cases. One may also observe that when $s$ becomes larger and $p\leq 0.5$ , the successful rates of both methods appear to become lower as $p$ becomes smaller. The possible reason is that when $s$ is large and $p$ is too small, finding a solution of problem (4) or (39) can be rather difficult and hence it is less likely for a stationary point to be a good candidate. Therefore, both SPeL1 and SPeL2 may still need some improvements for the hard cases ( $p$ is small and $s$ is large). We will leave this interesting research topic in the future.

5 Concluding remarks

In this paper, we consider a unified $L_{p}$ - $L_{q}$ sparse optimization problem (1) and study various properties of its optimal solutions. Specifically, without any condition on the sensing matrix $A$ , we provide upper bounds in cardinality and infinity norm for the optimal solutions, and show that all optimal solutions must be at the boundary of the feasible set when $0<p\leq 1$ ; see Theorem 2.2. Moreover, for $q\in\{1,\infty\}$ , we show that the $L_{q}$ -constrained problem with $0<p<1$ has finitely many optimal solutions; see Proposition 2.5 and Remark 2.6. We further show that, for $q\in\{1,\infty\}$ , there exists $0<p^{*}<1$ such that the solution set of the problem with any $0<p<p^{*}$ is contained in the solution set of the problem with $p=0$ and there also exists $0<\overline{p}<p^{*}$ such that the solution set of the problem with any $0<p\leq\overline{p}$ remains unchanged; see Theorem 2.9 and Remark 2.10. An estimation of such $p^{*}$ is also provided in Theorem 2.12. A convergent smoothing penalty method is also proposed to solve the $L_{1}$ -constrained problem with $0<p<1$ . Some numerical examples are presented to implicitly illustrate the theoretical results and show the efficiency of the proposed method for solving the constrained $L_{p}$ - $L_{1}$ problem under different noises.

{APPENDICES}

6 Proof of Lemma 2.7

First, for $k=1,\cdots,n$ , we define $p_{k}(\bm{a}):={\textstyle\sum_{j=1}^{n}}a_{j}^{k}$ , $p_{k}(\bm{b}):={\textstyle\sum_{j=1}^{n}}b_{j}^{k}$ ,

[TABLE]

Then, from Viète’s formula [24], we see that $a_{1},\cdots,a_{n}$ and $b_{1},\cdots,b_{n}$ are the roots of $q_{n}(t)$ and $r_{n}(t)$ , respectively, where

[TABLE]

Moreover, from [30, Eq. ( $2.11^{\prime}$ )] and the discussions that follow, we have that, for $k=1,\cdots,n$ ,

[TABLE]

with $\Lambda_{0}(\bm{a})=\Lambda_{0}(\bm{b})=1$ . Notice that $\Lambda_{1}(\bm{a})=\Lambda_{1}(\bm{b})=p_{1}(\bm{a})=p_{1}(\bm{b})$ and $p_{k}(\bm{a})=p_{k}(\bm{b})$ for $k=1,\cdots,n$ . Thus, from (40), it is not hard to show by induction that $\Lambda_{k}(\bm{a})=\Lambda_{k}(\bm{b})$ holds for $k=2,\cdots,n$ . This implies that $q_{n}(t)$ and $r_{n}(t)$ have the same roots and hence $\bm{a}=\bm{b}$ .

7 Proof of Lemma 2.8

First, from the Taylor expansion (with Lagrange remainder), for any $0<p<1$ , $c>0$ and $k\geq 0$ , we have

[TABLE]

where $\xi_{k+1}$ is a number between 0 and $p\ln c$ . Then, for any $0<p<1$ and $k\geq 0$ , we have

[TABLE]

where, for $j=1,\cdots,s$ , $\xi_{i_{j},k+1}$ is a number between 0 and $p\ln|a_{i_{j}}|$ , and $\eta_{t_{j},k+1}$ is a number between 0 and $p\ln|b_{t_{j}}|$ . In the following, we consider two cases.

Case 1: $\Delta_{k}(\bm{a},\bm{b})=0$ for all $k=1,\cdots,s$ , where $\Delta_{k}(\bm{a},\bm{b})$ is defined as (10). In this case, we have $\sum_{j=1}^{s}(\ln|a_{i_{j}}|)^{k}=\sum_{j=1}^{s}(\ln|b_{t_{j}}|)^{k}$ for all $k=1,\cdots,s$ . This together with Lemma 2.7 further implies that $(\ln|a_{i_{1}}|,\cdots,\ln|a_{i_{s}}|)=(\ln|b_{t_{1}}|,\cdots,\ln|b_{t_{s}}|)$ and hence $(|a_{i_{1}}|,\cdots,|a_{i_{s}}|)=(|b_{t_{1}}|,\cdots,|b_{t_{s}}|)$ . Then, we have $\|\bm{a}\|_{p}^{p}=\|\bm{b}\|_{p}^{p}$ for any $p>0$ . This proves statement (i).

Case 2: Case 1 does not hold. In this case, there must exist some $1\leq\tilde{k}\leq s$ so that $\Delta_{\tilde{k}}(\bm{a},\bm{b})\neq 0$ and $\Delta_{k}(\bm{a},\bm{b})=0$ for $k=1,\cdots,\tilde{k}-1$ . Then, we have from (41) and (10) that

[TABLE]

where $\Xi_{\tilde{k}+1}^{p}(\bm{a},\bm{b}):={\textstyle\sum_{j=1}^{s}}\big{(}e^{\xi_{i_{j},\tilde{k}+1}}(\ln|a_{i_{j}}|)^{\tilde{k}+1}-e^{\eta_{t_{j},\tilde{k}+1}}(\ln|b_{t_{j}}|)^{\tilde{k}+1}\big{)}$ . Note also that $\Delta_{\tilde{k}}(\bm{a},\bm{b})\neq 0$ and $\frac{p}{\tilde{k}+1}\Xi_{\tilde{k}+1}^{p}(\bm{a},\bm{b})\to 0$ as $p\to 0$ . Thus, there must exist a sufficiently small $p^{\prime}$ such that

[TABLE]

We now consider the following two cases.

•

$\Delta_{\tilde{k}}(\bm{a},\bm{b})<0$ : in this case, using (42) and (43), we obtain that

[TABLE]

This implies that $\|\bm{a}\|_{p}^{p}<\|\bm{b}\|_{p}^{p}$ for any $p\in(0,\,p^{\prime}]$ .

•

$\Delta_{\tilde{k}}(\bm{a},\bm{b})>0$ : in this case, using (42) and (43), we obtain that

[TABLE]

This implies that $\|\bm{a}\|_{p}^{p}>\|\bm{b}\|_{p}^{p}$ for any $p\in(0,\,p^{\prime}]$ .

Combing the above results, we complete the proof for statement (ii).

8 Exact penalization

In this section, we show that problem (20) is actually an exact penalization for problem (4) with $0<p<1$ . For notational simplicity, we define a set $\mathcal{U}$ and a matrix $U$ as follows:

[TABLE]

where $\bm{u}_{i}\in\{-1,\,1\}^{m}$ and $\bm{u}_{i}\neq\bm{u}_{j}$ for any $i\neq j$ . Since each entry of $\bm{u}_{i}$ is either $1$ or $-1$ and the dimension of $\bm{u}_{i}$ is $m$ , then one can have $2^{m}$ different choices of $\bm{u}_{i}$ and hence such $\mathcal{U}$ and $U$ are well-defined. Moreover, it is easy to see that if $\bm{u}_{i}\in\mathcal{U}$ , then $-\bm{u}_{i}\in\mathcal{U}$ . A simple example is given as follows: let $m=2$ , then

[TABLE]

We next present some auxiliary lemmas, which will be useful in our analysis.

Lemma 8.1

Let $A\in\mathbb{R}^{m\times n}$ , $\bm{b}\in\mathbb{R}^{m}$ and $\sigma>0$ . Then, ${\rm FEA}(A,\bm{b},\sigma,1)$ can be equivalently rewritten as $\{\bm{x}\in\mathbb{R}^{n}:UA\bm{x}\leq U\bm{b}+\sigma\mathbf{1}\}$ , where $U$ is defined in (44) and $\mathbf{1}:=(1,\cdots,1)^{\top}\in\mathbb{R}^{2^{m}}$ .

Proof. Observe that

[TABLE]

where the first equality follows from $\|A\bm{x}-\bm{b}\|_{1}=\max\limits_{\|\bm{u}\|_{\infty}\leq 1}\langle\bm{u},\,A\bm{x}-\bm{b}\rangle$ , the second equality follows because the maximizer of $\max\limits_{\|\bm{u}\|_{\infty}\leq 1}\langle\bm{u},\,A\bm{x}-\bm{b}\rangle$ must be an extreme point of $\{\bm{u}:\|\bm{u}\|_{\infty}\leq 1\}$ (see [32, Corollary 32.3.4]) and $\mathcal{U}$ is the set of all extreme points of $\{\bm{u}:\|\bm{u}\|_{\infty}\leq 1\}$ . This completes the proof.

From Lemma 8.1, it is easy to see that the feasible set ${\rm FEA}(A,\bm{b},\sigma,1)$ is a convex polyhedron. This together with the Hoffman error bound theorem [26] gives the following lemma.

Lemma 8.2

There exists a constant $\tilde{c}>0$ such that

[TABLE]

holds for any $\bm{x}\in\mathbb{R}^{n}$ , where $\tilde{A}=UA$ , $\tilde{\bm{b}}=U\bm{b}+\sigma\mathbf{1}$ and $U$ is defined in (44).

Based on this error bound result, we further give the following lemma.

Lemma 8.3

There exists a constant $c>0$ such that, for any $\bm{x}\in\mathbb{R}^{n}$ , we have

[TABLE]

Proof. We first show that, for any $\bm{x}\in\mathbb{R}^{n}$ , it holds that

[TABLE]

where $\tilde{A}$ and $\tilde{\bm{b}}$ are defined in Lemma 8.2. Indeed, for any $\bm{x}\in\mathbb{R}^{n}$ , there exists some $\tilde{\bm{u}}\in\{-1,\,1\}^{m}$ such that $\|A\bm{x}-\bm{b}\|_{1}=\tilde{\bm{u}}^{\top}(A\bm{x}-\bm{b})$ . Then, we have

[TABLE]

On the other hand, from $\|A\bm{x}-\bm{b}\|_{1}=\max\limits_{\|\bm{u}\|_{\infty}\leq 1}\langle\bm{u},\,A\bm{x}-\bm{b}\rangle$ , we have

[TABLE]

Then, we see that

[TABLE]

where the second equality follows because if $\bm{u}_{i}\in\mathcal{U}$ and $\bm{u}_{i}^{\top}(A\bm{x}-\bm{b})-\sigma>0$ , then $-\bm{u}_{i}\in\mathcal{U}$ and $-\bm{u}_{i}^{\top}(A\bm{x}-\bm{b})-\sigma<-\left(\bm{u}_{i}^{\top}(A\bm{x}-\bm{b})-\sigma\right)<0$ , and the inequality follows from (46). From the above, we obtain (45). This together with Lemma 8.2 completes the proof.

Now, we are ready to present our exact penalization results. Our first theorem concerns local minimizers of problems (4) and (20). The other two theorems concern $\epsilon$ -minimizers of problems (4) and (20) (see definitions later).

Theorem 8.4

Suppose that $\bm{x}^{*}$ is a local minimizer of (4). Then, there exists a $\lambda^{*}>0$ such that $\bm{x}^{*}$ is a local minimizer of (20) whenever $\lambda\geq\lambda^{*}$ .

Proof. We first assume that $\bm{x}^{*}=0$ and consider any bounded neighborhood $\mathcal{N}$ of 0 and $\lambda>0$ . Let $L$ denote a Lipschitz constant of the function $\bm{x}\mapsto\lambda(\|A\bm{x}-\bm{b}\|_{1}-\sigma)_{+}$ on $\mathcal{N}$ . For this $L$ , one can verify that there exists a neighborhood $\widetilde{\mathcal{N}}\subseteq\mathcal{N}$ of 0 such that $\|\bm{x}\|_{p}^{p}\geq L\|\bm{x}\|$ for all $\bm{x}\in\widetilde{\mathcal{N}}$ . Then, for any $\bm{x}\in\widetilde{\mathcal{N}}$ , we have

[TABLE]

where the last inequality follows from the definition of $L$ being a Lipschitz constant. This shows that $\bm{x}^{*}=0$ is a local minimizer of (20) for any $\lambda>0$ .

From now on, we assume that $\bm{x}^{*}\neq 0$ . Let $\mathcal{J}:=\mathrm{supp}(\bm{x}^{*})$ for simplicity. Then, $\mathcal{J}\neq\emptyset$ since $\bm{x}^{*}\neq 0$ . Since $\bm{x}^{*}$ is a local minimizer of (4), one can verify that $\bm{x}^{*}_{\mathcal{J}}$ is a local minimizer of the following problem:

[TABLE]

Let $\tilde{\epsilon}=\frac{1}{2}\min\big{\{}|x^{*}_{i}|:i\in\mathcal{J}\big{\}}>0$ . Thus, there exists a small $\delta>0$ such that $\bm{x}^{*}_{\mathcal{J}}$ is a local minimizer of (47) and $\min\big{\{}|x_{i}|:i\in\mathcal{J}\big{\}}>\tilde{\epsilon}$ for all $\bm{x}_{\mathcal{J}}\in\mathcal{B}(\bm{x}^{*}_{\mathcal{J}};\delta)$ . Moreover, note that $\bm{x}_{\mathcal{J}}\mapsto\|\bm{x}_{\mathcal{J}}\|_{p}^{p}$ is Lipschitz continuous on $\mathcal{B}(\bm{x}^{*}_{\mathcal{J}};\delta)$ and there exists a constant $c^{\prime}>0$ such that $\mathrm{dist}(\bm{x}_{\mathcal{J}},\,\Omega_{\mathcal{J}})\leq c^{\prime}\,(\|A_{\mathcal{J}}\bm{x}_{\mathcal{J}}-\bm{b}\|_{1}-\sigma)_{+}$ for all $\bm{x}_{\mathcal{J}}\in\mathcal{B}(\bm{x}^{*}_{\mathcal{J}};\delta)$ (see Lemma 8.3). Then, from [12, Lemma 3.1] (or [28, Proposition 4]), there exists a $\lambda^{*}>0$ such that, for any $\lambda\geq\lambda^{*}$ , $\bm{x}^{*}_{\mathcal{J}}$ is a local minimizer of the following problem:

[TABLE]

i.e., there exists a neighborhood $\mathcal{N}_{\mathcal{J}}$ of 0 with $\mathcal{N}_{\mathcal{J}}\subseteq\mathcal{B}(0;\frac{\delta}{2})$ such that

[TABLE]

We now show that $\bm{x}^{*}$ is a local minimizer of (20) for any $\lambda\geq\lambda^{*}$ . Fix any $\epsilon>0$ and any $\lambda\geq\lambda^{*}$ . Consider the bounded neighborhood $\mathcal{V}:=\mathcal{N}_{\mathcal{J}}\times(-\epsilon,\,\epsilon)^{n-|\mathcal{J}|}$ of 0 and let $\widetilde{L}$ be a Lipschitz constant of the function $g_{\lambda}(\bm{x}):=\lambda(\|A\bm{x}-\bm{b}\|_{1}-\sigma)_{+}$ on $\bm{x}^{*}+\mathcal{V}$ . For this $\widetilde{L}$ , there exists an $\tilde{\epsilon}\in(0,\epsilon)$ such that $\|\bm{v}_{\mathcal{J}^{c}}\|_{p}^{p}\geq\widetilde{L}\|\bm{v}_{\mathcal{J}^{c}}\|$ for all $\bm{v}_{\mathcal{J}^{c}}\in(-\tilde{\epsilon},\,\tilde{\epsilon})^{n-|\mathcal{J}|}$ . Then, for any $\bm{v}\in\widetilde{\mathcal{V}}:=\mathcal{N}_{\mathcal{J}}\times(-\tilde{\epsilon},\,\tilde{\epsilon})^{n-|\mathcal{J}|}$ , we have

[TABLE]

where the first inequality follows from the Lipschitz continuity of $g_{\lambda}$ with Lipschtiz constant $\widetilde{L}$ and the last inequality follows from (48). This shows that $\bm{x}^{*}$ is a local minimizer of (20) for any $\lambda\geq\lambda^{*}$ and completes the proof.

We next study $\epsilon$ -minimizers of (4) and (20), which are defined as follows.

Definition 8.5 ( $\epsilon$ -minimizer)

Let $\epsilon>0$ .

(i)

$\bm{x}_{\epsilon}$ * is said to be an $\epsilon$ -minimizer of problem (4) if $\bm{x}_{\epsilon}\in{\rm FEA}(A,\bm{b},\sigma,1)$ and $\|\bm{x}_{\epsilon}\|_{p}^{p}\leq\min\big{\{}\|\bm{x}\|_{p}^{p}:\bm{x}\in{\rm FEA}(A,\bm{b},\sigma,1)\big{\}}+\epsilon$ .*

(ii)

$\bm{x}_{\epsilon}$ * is said to be an $\epsilon$ -minimizer of problem (20) if $F_{\lambda}(\bm{x}_{\epsilon})\leq\min\limits_{\bm{x}\in\mathbb{R}^{n}}F_{\lambda}(\bm{x})+\epsilon$ .*

We also introduce the following function:

[TABLE]

where $\mu>0$ is a constant. Note that $\Psi_{\mu}$ is continuously differentiable. Moreover, from the discussions in [12, Section 3.3], we have that

[TABLE]

Then, we characterize the relation between the global minimizer of problem (4) and the $\epsilon$ -minimizer of problem (20) in the next theorem.

Theorem 8.6

Suppose that $\bm{x}^{*}$ is a global minimizer of problem (4). Then, for any $\epsilon>0$ , there exists a $\lambda_{\epsilon}^{*}>0$ such that $\bm{x}^{*}$ is an $\epsilon$ -minimizer of problem (20) whenever $\lambda\geq\lambda_{\epsilon}^{*}$ .

Proof. First, for any $\epsilon>0$ , we consider $\mu=2\left(\epsilon/n\right)^{\frac{1}{p}}$ and $\Psi_{\mu}$ defined in (49). Then, we see from (50) and (51) that

[TABLE]

and $\Psi_{\mu}$ is globally Lipschitz continuous with Lipschitz constant $L_{\mu}:=\sqrt{n}p\mu^{p-1}$ . Now, let $\lambda_{\epsilon}^{*}:=cL_{\mu}$ , where $c>0$ is chosen as in Lemma 8.3. For any $\bm{x}\in\mathbb{R}^{n}$ , we also use $\mathcal{P}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x})$ to denote the projection of $\bm{x}$ on ${\rm FEA}(A,\bm{b},\sigma,1)$ . Then, for $\lambda\geq\lambda_{\epsilon}^{*}$ and any $\bm{x}\in\mathbb{R}^{n}$ ,

[TABLE]

where the first inequality follows from (52), the second inequality follows from Lemma 8.3, the third inequality follows from $\lambda\geq\lambda_{\epsilon}^{*}=cL_{\mu}$ , the fourth inequality follows the Lipschitz continuity of $\Psi_{\mu}$ with Lipschtiz constant $L_{\mu}$ , and the last two inequalities follows from (52) and the definition of $\bm{x}^{*}$ as a minimizer of problem (4). This shows that $\bm{x}^{*}$ is an $\epsilon$ -minimizer of problem (20) and completes the proof.

From Theorems 8.4 and 8.6, we see that if $\bm{x}^{*}$ is a local minimizer or global minimizer of problem (4), then it is also a local minimizer or $\epsilon$ -minimizer of problem (20). Conversely, it is easy to see that if $\bm{x}^{*}$ is a local minimizer or $\epsilon$ -minimizer of problem (20) for some $\lambda>0$ and $\bm{x}^{*}\in{\rm FEA}(A,\bm{b},\sigma,1)$ , then it is also a local minimizer or $\epsilon$ -minimizer of problem (4). Finally, we shall study the case when $\bm{x}^{*}$ is a global minimizer of problem (20) for some $\lambda>0$ but $\bm{x}^{*}\notin{\rm FEA}(A,\bm{b},\sigma,1)$ .

Theorem 8.7

Suppose that $\tilde{\bm{x}}$ is an arbitrary feasible point of problem (4), i.e., $\tilde{\bm{x}}\in{\rm FEA}(A,\bm{b},\sigma,1)$ . Take any $\epsilon>0$ and consider any $\lambda\geq c\left(n^{\frac{p}{2}-1}\epsilon\right)^{-\frac{1}{p}}\|\tilde{\bm{x}}\|_{p}^{p}$ , where $c>0$ is chosen as in Lemma 8.3. Then, for any global minimizer $\bm{x}_{\lambda}^{*}$ of problem (20), the projection $\mathcal{P}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}_{\lambda}^{*})$ is an $\epsilon$ -minimizer of problem (4).

Proof. First, from the definition of $F_{\lambda}$ and the global optimality of $\bm{x}_{\lambda}^{*}$ , we have

[TABLE]

Then, for any $\bm{x}\in{\rm FEA}(A,\bm{b},\sigma,1)$ , we have

[TABLE]

where the first inequality follows from (53), the second inequality follows from [12, Lemma 2.4], the third inequality follows from the concavity of the function $t\mapsto t^{\frac{p}{2}}$ for nonnegative $t$ , the fourth inequality follows from Lemma 8.3 and the last two inequality follows from (54) and the choice of $\lambda$ . This implies that $\mathcal{P}_{{\rm FEA}(A,\bm{b},\sigma,1)}(\bm{x}_{\lambda}^{*})$ is an $\epsilon$ -minimizer of (4) and completes the proof.

Acknowledgments

The authors are grateful to the editor and the anonymous referees for their valuable suggestions and comments, which have helped to improve the quality of this paper. The authors would also like to thank the CAS AMSS-PolyU Joint Laboratory of Applied Mathematics for its support while this research was being conducted. The research of Shuhuang Xiang was supported in part by the National Natural Science Foundation of China (Grant No. 11771454).

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Attouch et al. [2013] Attouch H, Bolte J, Svaiter B (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137(1):91–129.
2Beck [2017] Beck A (2017) First-Order Methods in Optimization , volume 25 (SIAM).
3Bruckstein et al. [2009] Bruckstein A, Donoho D, Elad M (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51(1):34–81.
4Cai et al. [2011] Cai T, Liu W, Luo X (2011) A constrained ℓ 1 subscript ℓ 1 \ell_{1} minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494):594–607.
5Candès et al. [2006] Candès E, Romberg J, Tao T (2006) Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8):1207–1223.
6Candès and Tao [2005] Candès E, Tao T (2005) Decoding by linear programming. IEEE Trans. Inf. Theory 51(12):4203–4215.
7Candès and Tao [2007] Candès E, Tao T (2007) The Dantzig selector: Statistical estimation when p 𝑝 p is much larger than n 𝑛 n . Ann. Stat. 35(6):2313–2351.
8Chartrand [2007] Chartrand R (2007) Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14(10):707–710.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

1 Introduction

Notation and Preliminaries

2 Properties of solutions of problem (1)

Lemma 2.1

Theorem 2.2

Remark 2.3** **(The sparse solution of the LpL_{p}Lp​-L2L_{2}L2​ problem)

Lemma 2.4

Proposition 2.5

Remark 2.6** **(Comments on Proposition 2.5)

Lemma 2.7

Lemma 2.8

Theorem 2.9

Remark 2.10** **(Comments on Theorem 2.9)

Corollary 2.11

Theorem 2.12

Example 2.13

3 A smoothing penalty method

Definition 3.1** **(Stationary point of problem (4) with 0<p<10<p<10<p<1)

Theorem 3.2

Remark 3.3** **(Comments on condition (31))

4 Numerical simulations

5 Concluding remarks

6 Proof of Lemma 2.7

7 Proof of Lemma 2.8

8 Exact penalization

Lemma 8.1

Lemma 8.2

Lemma 8.3

Theorem 8.4

Definition 8.5** (ϵ\epsilonϵ-minimizer**)

Theorem 8.6

Theorem 8.7

Acknowledgments

Remark 2.3 (The sparse solution of the $L_{p}$ - $L_{2}$ problem)

Remark 2.6 (Comments on Proposition 2.5)

Remark 2.10 (Comments on Theorem 2.9)

Definition 3.1 (Stationary point of problem (4) with $0<p<1$ )

Remark 3.3 (Comments on condition (31))

Definition 8.5 ( $\epsilon$ -minimizer)