An Efficient and Globally Convergent Algorithm for   $\ell_{p,q}$-$\ell_{r}$ Model in Group Sparse Optimization

Yunhua Xue; Yanfei Feng; Chunlin Wu

arXiv:1904.01887·math.NA·April 4, 2019

An Efficient and Globally Convergent Algorithm for $\ell_{p,q}$-$\ell_{r}$ Model in Group Sparse Optimization

Yunhua Xue, Yanfei Feng, Chunlin Wu

PDF

Open Access

TL;DR

This paper introduces a new proximally linearized algorithm called InISSAPL designed to efficiently solve non-Lipschitz group sparse optimization problems involving the _{p,q}-_{r} model, ensuring global convergence.

Contribution

The paper presents the first efficient algorithm with global convergence guarantees for the _{p,q}-_{r} group sparse optimization problem.

Findings

01

The algorithm converges globally for the non-Lipschitz _{p,q}-_{r} model.

02

It outperforms existing methods in efficiency and accuracy.

03

The method is applicable to various group sparse optimization tasks.

Abstract

Group sparsity combines the underlying sparsity and group structure of the data in problems. We develop a proximally linearized algorithm InISSAPL for the non-Lipschitz group sparse $ℓ_{p, q}$ - $ℓ_{r}$ optimization problem.

Tables3

Table 1. Table 1: Relative Errors of the reconstruction by InISSAPL with two kinds of starting points.

		$𝐀_{1}$	$𝐀_{2}$	$𝐀_{3}$
$s = 8$	$ϵ$	$0.0042$	$0.0036$	$0.0041$
	$\bar{ϵ}$	$0.0042$	$0.0036$	$0.0041$
$s = 16$	$ϵ$	$0.0059$	$0.0063$	$0.0058$
	$\bar{ϵ}$	$0.0059$	$0.0063$	$0.0058$
$s = 24$	$ϵ$	$0.4107$	$0.0093$	$0.0084$
	$\bar{ϵ}$	$0.4013$	$0.1016$	$0.0095$

Table 2. Table 2: Relative Error ϵ italic-ϵ \epsilon over r 𝑟 r for the Laplace noise (top), Gaussian noise (middle), uniform noise (bottom) with p = 2 , q = 0.5 , σ = 0.01 formulae-sequence 𝑝 2 formulae-sequence 𝑞 0.5 𝜎 0.01 p=2,q=0.5,\sigma=0.01 .

Laplace noise	$ϵ (r = 1)$	$ϵ (r = 2)$	$ϵ (r = \infty)$
$s = 4$	0.0370	$0.0854$	$0.0858$
$s = 8$	0.0270	$0.0564$	$0.0588$
$s = 12$	0.0362	$0.0569$	$0.0586$
$s = 16$	0.0491	$0.0635$	$0.0654$
Gaussian noise	$ϵ (r = 1)$	$ϵ (r = 2)$	$ϵ (r = \infty)$
$s = 4$	$0.0507$	0.0186	$0.0504$
$s = 8$	$0.0440$	0.0203	$0.0405$
$s = 12$	$0.0420$	0.0247	$0.0408$
$s = 16$	$0.0604$	0.0297	$0.0638$
uniform noise	$ϵ (r = 1)$	$ϵ (r = 2)$	$ϵ (r = \infty)$
$s = 4$	$0.0265$	$0.0234$	0.0110
$s = 8$	$0.0254$	$0.0216$	0.0127
$s = 12$	$0.0220$	$0.0192$	0.0159
$s = 16$	$0.0178$	$0.0170$	0.0147

Table 3. Table 3: Comparisons on Running time and Relative Error ϵ italic-ϵ \epsilon for PGM-GSO, e-PGM-GSO, InISSAPL algorithms in two problems with different size. It can be seen that the advantages of our algorithm become larger when the problem scale increases.

$s$	Time(s)	$ϵ$	Time(s)	$ϵ$	Time(s)	$ϵ$
$M = 256$
$N = 1024$	PGM-GSO		e-PGM-GSO		InISSAPL
4	$0.56$	$0.0024$	$0.59$	$0.0031$	0.46	0.0023
8	$0.58$	0.0025	$0.59$	$0.0033$	0.49	$0.0027$
12	$0.58$	0.0030	$0.60$	$0.0032$	0.50	0.0030
16	$0.59$	$0.0033$	$0.81$	$0.0040$	0.52	0.0031
$M = 1024$
$N = 4096$	PGM-GSO		e-PGM-GSO		InISSAPL
$s$	Time(s)	$ϵ$	Time(s)	$ϵ$	Time(s)	$ϵ$
25	$18.04$	$0.0026$	$18.99$	$0.0039$	3.98	0.0025
50	$18.07$	0.0027	$18.32$	$0.0037$	4.20	0.0027
75	$18.18$	$0.0029$	$18.21$	$0.0046$	6.58	0.0028
100	$18.25$	$0.6095$	$18.87$	$0.8928$	9.02	0.0866

Equations240

x \in R^{N} min E (x) := ∥ x ∥_{p, q}^{q} + F_{r} (x),

x \in R^{N} min E (x) := ∥ x ∥_{p, q}^{q} + F_{r} (x),

F_{r}(\mathbf{x})=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{r\alpha}\left\|\mathbf{A}\mathbf{x}-\mathbf{y}\right\|_{r}^{r},&r\geq 1,\\ \displaystyle\frac{1}{\alpha}\left\|\mathbf{A}\mathbf{x}-\mathbf{y}\right\|_{\infty},&r=\infty,\end{array}\right.

F_{r}(\mathbf{x})=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{r\alpha}\left\|\mathbf{A}\mathbf{x}-\mathbf{y}\right\|_{r}^{r},&r\geq 1,\\ \displaystyle\frac{1}{\alpha}\left\|\mathbf{A}\mathbf{x}-\mathbf{y}\right\|_{\infty},&r=\infty,\end{array}\right.

∥ x ∥_{p, q} = (i = 1 \sum g ∥ x_{i} ∥_{p}^{q})^{1/ q},

∥ x ∥_{p, q} = (i = 1 \sum g ∥ x_{i} ∥_{p}^{q})^{1/ q},

y = Ax + n,

y = Ax + n,

supp_{G} (x) := {i \in G : x_{i} \neq = 0},

supp_{G} (x) := {i \in G : x_{i} \neq = 0},

supp (x_{i}) = {j \in J_{i} : x_{i, j} \neq = 0} .

supp (x_{i}) = {j \in J_{i} : x_{i, j} \neq = 0} .

\mathbf{A}=\left[\begin{array}[]{cccc}A_{1,\sf 1}&A_{1,\sf 2}&\cdots&A_{1,\mathsf{g}}\\ \cdots&\cdots&\cdots&\cdots\\ A_{M,\sf 1}&A_{M,\sf 2}&\cdots&A_{M,\mathsf{g}}\\ \end{array}\right].

\mathbf{A}=\left[\begin{array}[]{cccc}A_{1,\sf 1}&A_{1,\sf 2}&\cdots&A_{1,\mathsf{g}}\\ \cdots&\cdots&\cdots&\cdots\\ A_{M,\sf 1}&A_{M,\sf 2}&\cdots&A_{M,\mathsf{g}}\\ \end{array}\right].

ϕ (y) \leq ϕ (x) + ϕ^{'} (x) (y - x), \forall x \in (0, \infty), y \in [0, \infty) .

ϕ (y) \leq ϕ (x) + ϕ^{'} (x) (y - x), \forall x \in (0, \infty), y \in [0, \infty) .

∣ ϕ^{'} (x) - ϕ^{'} (y) ∣ \leq L_{c} ∣ x - y ∣ .

∣ ϕ^{'} (x) - ϕ^{'} (y) ∣ \leq L_{c} ∣ x - y ∣ .

∥ y ∥_{γ_{2}} \leq ∥ y ∥_{γ_{1}}, 0 < γ_{1} \leq γ_{2} .

∥ y ∥_{γ_{2}} \leq ∥ y ∥_{γ_{1}}, 0 < γ_{1} \leq γ_{2} .

∥ y ∥_{s} \leq C_{s} ∥ y ∥_{s + 1} .

∥ y ∥_{s} \leq C_{s} ∥ y ∥_{s + 1} .

∥ y ∥_{s} \leq m^{1 - 2^{- Z}} ∥ y ∥_{2},

∥ y ∥_{s} \leq m^{1 - 2^{- Z}} ∥ y ∥_{2},

∥ y ∥_{s} \leq C_{s} ∥ y ∥_{s + 1},

∥ y ∥_{s} \leq C_{s} ∥ y ∥_{s + 1},

\partial (ϕ \circ g) (y) = j = 1 \prod m S_{j},

\partial (ϕ \circ g) (y) = j = 1 \prod m S_{j},

\partial (ϕ \circ g) (y) = j = 1 \prod m S_{j},

\partial (ϕ \circ g) (y) = j = 1 \prod m S_{j},

S_{j} = ⎩ ⎨ ⎧ ϕ^{'} (∥ y ∥_{p}) ∥ y ∥_{p}^{1 - p} ∣ y_{j} ∣^{p - 1} sgn (y_{j}), ϕ^{'} (∥ y ∥_{1}) sgn (y_{j}), [- ϕ^{'} (∥ y ∥_{1}), ϕ^{'} (∥ y ∥_{1})], p > 1, j \in supp (y) \mbox an d p = 1, j \in / supp (y) \mbox an d p = 1.

S_{j} = ⎩ ⎨ ⎧ ϕ^{'} (∥ y ∥_{p}) ∥ y ∥_{p}^{1 - p} ∣ y_{j} ∣^{p - 1} sgn (y_{j}), ϕ^{'} (∥ y ∥_{1}) sgn (y_{j}), [- ϕ^{'} (∥ y ∥_{1}), ϕ^{'} (∥ y ∥_{1})], p > 1, j \in supp (y) \mbox an d p = 1, j \in / supp (y) \mbox an d p = 1.

z \to 0 z \neq = 0 lim inf \frac{∥ z ∥ _{p}^{q} - < u , z - 0 >}{∥ z - 0 ∥ _{2}} \geq 0.

z \to 0 z \neq = 0 lim inf \frac{∥ z ∥ _{p}^{q} - < u , z - 0 >}{∥ z - 0 ∥ _{2}} \geq 0.

∥ z ∥_{p} \geq C ∥ z ∥_{2},

∥ z ∥_{p} \geq C ∥ z ∥_{2},

\frac{∥ z ∥ _{p}^{q} - < u , z - 0 >}{∥ z - 0 ∥ _{2}} \geq \frac{C ^{q} ∥ z ∥ _{2}^{q} - < u , z >}{∥ z ∥ _{2}} \geq 0, z \to 0.

\frac{∥ z ∥ _{p}^{q} - < u , z - 0 >}{∥ z - 0 ∥ _{2}} \geq \frac{C ^{q} ∥ z ∥ _{2}^{q} - < u , z >}{∥ z ∥ _{2}} \geq 0, z \to 0.

z_{k} = y_{k}, k \neq = j z_{j} \to y_{j} z \neq = y lim inf \frac{∥ z ∥ _{1}^{q} - ∥ y ∥ _{1}^{q} - < u , z - y >}{∥ z - y ∥ _{2}} \geq 0.

z_{k} = y_{k}, k \neq = j z_{j} \to y_{j} z \neq = y lim inf \frac{∥ z ∥ _{1}^{q} - ∥ y ∥ _{1}^{q} - < u , z - y >}{∥ z - y ∥ _{2}} \geq 0.

{(u)_{j} = ϕ^{'} (∥ y ∥_{1}) \cdot sgn (y_{j}), ∣ (u)_{j} ∣ \leq ϕ^{'} (∥ y ∥_{1}), j \in supp (y), j \in / supp (y),

{(u)_{j} = ϕ^{'} (∥ y ∥_{1}) \cdot sgn (y_{j}), ∣ (u)_{j} ∣ \leq ϕ^{'} (∥ y ∥_{1}), j \in supp (y), j \in / supp (y),

h (z) = j \in supp (y) \sum ∣ z_{j} ∣ + j \in / supp (y) \sum k_{j} z_{j}^{q},

h (z) = j \in supp (y) \sum ∣ z_{j} ∣ + j \in / supp (y) \sum k_{j} z_{j}^{q},

(\nabla h (y))_{j} = {ϕ^{'} (∥ y ∥_{1}) \cdot sgn (y_{j}), ϕ^{'} (∥ y ∥_{1}) \cdot k_{j}, j \in supp (y), j \in / supp (y),

(\nabla h (y))_{j} = {ϕ^{'} (∥ y ∥_{1}) \cdot sgn (y_{j}), ϕ^{'} (∥ y ∥_{1}) \cdot k_{j}, j \in supp (y), j \in / supp (y),

{(u^{(k)})_{j} = ϕ^{'} (z^{(k)}_{1}) \cdot sgn (z_{j}^{(k)}), (u^{(k)})_{j} \leq ϕ^{'} (z^{(k)}_{1}), j \in supp (z^{(k)}), j \in / supp (z^{(k)}),

{(u^{(k)})_{j} = ϕ^{'} (z^{(k)}_{1}) \cdot sgn (z_{j}^{(k)}), (u^{(k)})_{j} \leq ϕ^{'} (z^{(k)}_{1}), j \in supp (z^{(k)}), j \in / supp (z^{(k)}),

\partial (ϕ \circ g) (y) = \partial (ϕ \circ g) (y), \partial^{\infty} (ϕ \circ g) (y) = (\partial (ϕ \circ g) (y))^{\infty} .

\partial (ϕ \circ g) (y) = \partial (ϕ \circ g) (y), \partial^{\infty} (ϕ \circ g) (y) = (\partial (ϕ \circ g) (y))^{\infty} .

\partial (ϕ \circ g) (y) = (- \infty, \infty)^{m},

\partial (ϕ \circ g) (y) = (- \infty, \infty)^{m},

\partial^{\infty} (ϕ \circ g) (y) = (\partial (ϕ \circ g) (y))^{\infty} = {0},

\partial^{\infty} (ϕ \circ g) (y) = (\partial (ϕ \circ g) (y))^{\infty} = {0},

∥ x ∥_{p, q}^{q} = i \in G \sum ϕ (∥ x_{i} ∥_{p})

∥ x ∥_{p, q}^{q} = i \in G \sum ϕ (∥ x_{i} ∥_{p})

E (x) = i \in G \sum ϕ (∥ x_{i} ∥_{p}) + F_{r} (x), p \geq 1, 1 \leq r \leq \infty,

E (x) = i \in G \sum ϕ (∥ x_{i} ∥_{p}) + F_{r} (x), p \geq 1, 1 \leq r \leq \infty,

\partial E (x) = \partial (i \in G \sum ϕ (∥ x_{i} ∥_{p})) + \partial F_{r} (x) .

\partial E (x) = \partial (i \in G \sum ϕ (∥ x_{i} ∥_{p})) + \partial F_{r} (x) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research · Optimization and Variational Analysis

Full text

An Efficient and Globally Convergent Algorithm for $\ell_{p,q}$ - $\ell_{r}$ Model in Group Sparse Optimization

Yunhua Xue,Yanfei Feng and Chunlin Wu,

School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China Corresponding author. [email protected]

Abstract

Group sparsity combines the underlying sparsity and group structure of the data in problems. We develop a proximally linearized algorithm InISSAPL for the non-Lipschitz group sparse $\ell_{p,q}$ - $\ell_{r}$ optimization problem. The algorithm gives a unified framework for all the parameters $p\geq 1,0<q<1,1\leq r\leq\infty$ , which is applicable to different kinds of measurement noise. In particular, it includes the addition of the non-smooth $\ell_{1,q}$ regularization term and the non-smooth $\ell_{1}$ / $\ell_{\infty}$ fidelity term as special cases. It allows an inexact inner loop accessible to the implementation of scaled ADMM, and still has global convergence. The algorithm is efficient and fast with computation only on the shrinking group support set. Many numerical experiments are presented for the algorithm with diversity of parameters $p,q,r$ . The comparisons show that our algorithm is superior to others in the existing works.

Keywords. group sparse, $\ell_{p,q}$ - $\ell_{r}$ model, non-Lipschitz optimization, Laplace noise, Gaussian noise, uniform distribution noise, lower bound theory, Kurdyka-Łojasiewicz property

Mathematics subject classification (2010). 49M05, 65K10, 90C26, 90C30

1 Introduction

We consider the following $\ell_{p,q}$ - $\ell_{r}$ minimization problem

[TABLE]

where

[TABLE]

and $p\in[1,\infty)$ , $q\in(0,1)$ , $r\in[1,\infty]$ , $\alpha\in(0,\infty)$ , $\mathbf{A}\in\mathbb{R}^{M\times N}$ , $\mathbf{x}\in\mathbb{R}^{N}$ , $\mathbf{y}\in\mathbb{R}^{M}$ , the $\ell_{p,q}$ regularization term measures the group sparse structure of $\mathbf{x}$ , which is a quasi-norm, defined by

[TABLE]

where $\mathbf{x}_{\mathsf{i}},\mathsf{i}=\sf 1,\cdots,\mathsf{g}$ are the group members defined in Section 2 and $\|\cdot\|_{p}$ is the standard $L^{p}$ norm for vectors.

In Big Data era, data used to describe the structures, segments and features always have group property. Namely, they have a natural grouping of their components. Sparsity allows us to reconstruct high-dimensional data with only a small number of variables, leading to better recovery performance. By combining them, the recovery or reconstruction of group sparse data is enhanced to an active research topic in sparse optimization. The group sparse minimization problem (1.1) by underdetermined linear measurements has a wide variety of applications, such as signal recovery [17, 21], image processing [31], compressed sensing [30], model selection in birth weight prediction [38], sparse learning [35], variable selection in gene finding [28] and so on. Therefore, it is meaningful to study efficient algorithms for this general group sparse optimization problem.

The general means that it covers a lot of case models for different parameters $p,q,r$ . We assume the observation

[TABLE]

where $\mathbf{n}\in\mathbb{R}^{M}$ represents the noise. The model here can be adapted for the diversity of noise by the parameter $r$ in the data fitting term $F_{r}(\mathbf{x})$ . As well known, for Gaussian noise, people use the $\ell_{2}$ fidelity term ( $r=2$ ). For Laplace noise or heavy-tailed noise such as impulsive noise, the $\ell_{1}$ fidelity term ( $r=1$ ) is a good choice. For the noise by uniform distribution or quantization error, the $\ell_{\infty}$ fidelity term ( $r=\infty$ ) suits.

There are many references to study the sparse optimization problem without group structure in it, i.e. the non-group model in which the number of groups $\mathsf{g}$ equals $N$ . Then the $\ell_{p,q}$ term in (1.1) is degenerated to $\ell_{q}~{}(0<q<1)$ regularization one. One class of methods is smoothing approximate methods [5, 14, 13, 24, 12]. By a smoothing function $\varphi(x,\theta)$ , the non-Lipschitz property of the objective function can be removed. The second class of methods is general iterative shrinkage-thresholding algorithms (GISA) for $\ell_{q}$ - $\ell_{2}$ problem [34, 40, 9]. GISA was inspired by the great success of soft thresholding and iterative shrinkage-thresholding algorithms (ISTA) [16, 3] for convex $\ell_{1}$ - $\ell_{2}$ problem. The third class of methods is the iterative reweighted minimization methods for $\ell_{q}$ - $\ell_{2}$ minimization problem; see, e.g. [23, 10, 15, 27]. Actually reweighted methods reformulate the original non-Lipschitz $\ell_{q}$ - $\ell_{2}$ to Lipschitz ones by a de-singularizing parameter. Very recently, [25, 39] developed methods by successively shrinking the support of the variables to overcome non-Lipschitz property, in which [25] considered the non-group case with $r\neq\infty$ and [39] focused on the image restoration with $r=2$ . To the best of our knowledge, we note that most of the references considered only $r=2$ in these methods.

For the group sparse optimization problem (1.1), most algorithms were proposed only in the case of $r=2$ as well. Hu et al. [21] investigated this problem via $\ell_{p,q}$ regularization, others developed algorithms for $\ell_{2,1}$ regularized least squares, e.g. group Lasso [38, 8, 17]. As noted before, it is important and necessary to develop algorithm for general $1\leq r\leq\infty$ . This will bring the difficulty to universally handle the noise parameter $r$ with regularized parameters $p,q$ in the group structure. In addition, the regularization term $\ell_{p,q}$ with parameters $p\geq 1,0<q<1$ in the objective function $\mathcal{E}$ in (1.1) leads to a non-convex, non-Lipschitz optimization problem. This non-smoothness becomes even serious for the $\ell_{1,q}$ regularization case. All these characteristics of the minimization model (1.1) result in a great challenge to solve it.

In this paper, we extend our recent work [25] to solve the general group sparse optimization problem (1.1). This extension is not trivial, because model (1.1) is more complicated and includes more nonsmooth cases than the non-group one in [25], as mentioned in the former paragraph. We firstly establish a motivating proposition by developing subdifferential lemmas in group variables. This gives us the rationality to design a unified iterative support shrinking algorithm over group support set of unknown variables for various $p,q,r$ . To make the algorithm more practical and easily implementable, we linearize the regularization term and present the InISSAPL algorithm to calculate the approximate solution. Although the algorithm allows an inexact inner loop, we prove its global convergence from a new lower bound theory for the $\ell_{p}$ norm of the nonzero groups of iteration sequence. The algorithm implementation by scaled ADMM is also discussed where, especially for the case of $r=\infty$ , we give an analytical derivation of the explicit solution of the corresponding subproblem. Numerical experiments show that the algorithm is not only robust to the diversity of noise, but also has good performance for different $p,q$ . Compared with others in group sparse optimization on relative errors, successful rates and running time, our algorithm outperforms them. The main characters of InISSAPL algorithm for model (1.1) are presented as follows,

(i)

The algorithm provides a unified framework for all the parameters $p,q,r$ . It can particularly deal with the case of the addition of non-smooth $\ell_{1,q}$ regularization term and non-smooth $\ell_{1}$ / $\ell_{\infty}$ fidelity term. 2. (ii)

The computation is implemented only on the shrinking group support set of $\mathbf{x}$ at each iteration step. Naturally our algorithm is efficient, especially for large scale sparse recovery problems. 3. (iii)

The key step is to overcome the non-Lipschitz property of the objective function and construct an appropriate subdifferential formula, when using KL property to prove the global convergence of the algorithm. It is solved by developing a lower bound theory of the nonzero groups of the iterative sequence and a technical construction of the subdifferential; see section 4 for details.

The rest of the paper is outlined as follows. In section 2, we give some basic notations and preliminaries. In section 3, we give the motivating proposition and propose the corresponding algorithms. In section 4, we establish the global convergence theorem for the proposed algorithms. In section 5, we describe the implementation of the algorithm by scaled ADMM. Numerical experiments and comparisons are showed in section 6. Section 7 concludes the paper.

2 Notations and preliminaries

Suppose that $\mathbf{A}$ is an $M\times N$ matrix and $\mathbf{x}$ is a column vector with $N$ components. $I=\left\{1,2,\dots,M\right\}$ denotes the row index set of $\mathbf{A}$ . To be specialized, we use another kind of upright font to express the group index such as $\mathsf{G},\mathsf{i},\mathsf{g}$ . Let $\mathbf{x}:=\left(\mathbf{x}_{\sf 1}^{T},\mathbf{x}_{\sf 2}^{T},\cdots,\mathbf{x}_{\mathsf{g}}^{T}\right)^{T}$ represent the group structure of $\mathbf{x}$ . $\mathsf{G}=\left\{\sf 1,\sf 2,\dots,\mathsf{g}\right\}$ denotes the group index set of $\mathbf{x}$ . For each group member $\mathbf{x}_{\mathsf{i}}$ , we denote by $J_{\mathsf{i}}=\{1,2,\cdots,N_{\mathsf{i}}\}$ the index set, then $N=N_{\sf 1}+\cdots+N_{\mathsf{g}}$ . We also refer to $\mathbf{x}_{\mathsf{i},j}$ as its $j$ th entry of $\mathbf{x}_{\mathsf{i}}$ and denote the group support set of $\mathbf{x}$ by

[TABLE]

where $\mathbf{x}_{\mathsf{i}}\neq\bf 0$ means that $\mathbf{x}_{\mathsf{i},j}\neq 0$ for some $j\in J_{\mathsf{i}}$ . Furthermore, we use $\mathbf{x}_{\mathsf{i}}=\bf 0$ when $\mathbf{x}_{\mathsf{i},j}=0$ for all $j\in J_{\mathsf{i}}$ . The support of group member $\mathbf{x}_{\mathsf{i}}$ is defined by

[TABLE]

Let $\mathsf{S}$ be a subset of $\mathsf{G}$ . We denote by $\mathbf{x}_{\mathsf{S}}$ the group vectors of $\mathbf{x}$ indexed by $\mathsf{S}$ , which consists of the nonzero group members of $\mathbf{x}$ when $\mathsf{S}=\operatorname{supp}_{\mathsf{G}}(\mathbf{x})$ .

For a matrix $\mathbf{A}\in\mathbb{R}^{M\times N}$ , we partition it into submatrices $A_{k,\mathsf{i}},k\in I,\mathsf{i}\in\mathsf{G}$ , which is the $k$ th row of $\mathbf{A}$ partitioned according to the group structure of $\mathbf{x}$ , i.e.,

[TABLE]

Because $A_{k,\mathsf{i}},k\in I,\mathsf{i}\in\mathsf{G}$ are row vectors, we denote by $(A_{k,\mathsf{i}})_{j}$ the $j$ -th entry of it. In a similar way with $\mathbf{x}_{\mathsf{S}}$ , we denote by $\mathbf{A}_{\mathsf{S}}$ the column sub-matrix of $\mathbf{A}$ consisting of the columns indexed by $\mathsf{S}$ .

Define $\phi:[0,\infty)\to[0,\infty)$ by $\phi(x)=x^{q}(0<q<1)$ . We state some useful properties for $\phi(\cdot)$ .

Proposition 2.1.

The function $\phi(\cdot)$ has the following properties:

(i)

$\phi(0)=0$ * and $\phi^{\prime}(x)=qx^{q-1}>0$ on $(0,\infty)$ .* 2. (ii)

$\phi(x)$ * is concave and the following inequality holds,*

[TABLE] 3. (iii)

For any $c>0$ , $\phi^{\prime}(x)$ is $L_{c}$ -Lipschitz continuous on $[c,\infty)$ , i.e., there exists a constant $L_{c}>0$ determined by $c$ , such that $\forall x,y\in[c,\infty)$ ,

[TABLE]

Lemma 2.2.

Let $\mathbf{y}\in\mathbb{R}^{m}$ be the m-dimensional vector, the following inequality holds:

[TABLE]

Proof.

Let $f(t)=\|\mathbf{y}\|_{t},~{}t>0$ , then $f(t)$ is monotone decreasing by the fact $f^{\prime}(t)<0$ for $t>0$ .

∎

Lemma 2.3.

Let $s>0,\mathbf{y}\in\mathbb{R}^{m}$ , then there exists constant $C_{s}>0$ , such that,

[TABLE]

Proof.

For $s\geq 1$ , the result can be verified easily from the norm equivalence in finite dimensional space. For $0<s<1$ , from [21, Lemma 1], we have

[TABLE]

where $Z$ is the smallest integer such that $2^{Z-1}s\geq 1$ . We use the norm equivalence once again to have

[TABLE]

where $C_{s}=m^{1-2^{-Z}}\cdot C$ , and $C$ is the relation coefficient of norm equivalence. ∎

3 Motivation and the proposed algorithm

3.1 Subdifferentials and regularity

By the definition of $\phi(\cdot)$ , we have $\left\|\mathbf{x}\right\|_{p,q}^{q}=\sum_{\mathsf{i}\in\mathsf{G}}\phi(\left\|\mathbf{x}_{\mathsf{i}}\right\|_{p})$ . We also define the norm function $g(\mathbf{y})=\left\|\mathbf{y}\right\|_{p}$ for a vector $\mathbf{y}$ . In order to calculate the subdifferential of the object function $\mathcal{E}(\mathbf{x})$ in (1.1), we give two lemmas firstly.

Lemma 3.1 (Subdifferential).

Let $\mathbf{y}\in\mathbb{R}^{m}$ be an $m$ -dimensional vector, we have the following results,

(i)

For $\mathbf{y}=\bf 0$ and $p\geq 1$ , the subdifferential is,

[TABLE]

where $S_{j}=(-\infty,\infty),\forall j=1,2,\cdots,m$ and $\Pi$ means the Cartesian product of sets; 2. (ii)

For $\mathbf{y}\neq\bf 0$ , the subdifferential would be

[TABLE]

where

[TABLE]

Proof.

For brevity, denote the set $\prod_{j=1}^{m}S_{j}$ by $S$ . In (i), let $\mathbf{u}\in\widehat{\partial}(\phi\circ g)(\mathbf{y})$ , which is the regular subdifferential at $\mathbf{y}=\bf 0$ . By the definition,

[TABLE]

From the equivalence of norms when $p\geq 1$ , we have

[TABLE]

where $C>0$ is a constant. It is sufficient to have

[TABLE]

This is true for any $\mathbf{u}\in S$ due to $0<q<1$ . Then the proof is finished by the fact that $\widehat{\partial}(\phi\circ g)(\mathbf{y})\subseteq\partial(\phi\circ g)(\mathbf{y})$ .

In (ii), for $p>1$ , the function $(\phi\circ g)(\mathbf{y})$ is continuously differential at $\mathbf{y}$ , so the subdifferential is the gradient in this case. For $p=1$ , we show that $S=\widehat{\partial}(\phi\circ g)(\mathbf{y})$ firstly. On one hand, let $\mathbf{u}\in\widehat{\partial}(\phi\circ g)(\mathbf{y})$ and $\mathbf{y}\neq\bf 0$ , the limit inferior hold along the special direction,

[TABLE]

Then we have

[TABLE]

by the differential mean value theorem. So $\widehat{\partial}(\phi\circ g)(\mathbf{y})\subseteq S$ .

On the other hand, we construct function $h(\mathbf{z})$ when $\mathbf{z}$ is in the neighbourhood of $\mathbf{y}$ :

[TABLE]

where $k_{j}\in[-1,1]$ . Then $h$ is differentiable at $\mathbf{y}$ and $h(\mathbf{z})\leq\phi(\left\|\mathbf{z}\right\|_{1})$ , $h(\mathbf{y})=\phi(\left\|\mathbf{y}\right\|_{1})$ . From [29, Proposition 8.5], we have $\nabla h(\mathbf{y})\in\widehat{\partial}(\phi\circ g)(\mathbf{y})$ . Here

[TABLE]

to obtain $S\subseteq\widehat{\partial}(\phi\circ g)(\mathbf{y})$ by the arbitrary $k_{j}\in[-1,1],j\notin\operatorname{supp}(\mathbf{y})$ . Hence $S=\widehat{\partial}(\phi\circ g)(\mathbf{y})$ .

The left is to show $\partial(\phi\circ g)(\mathbf{y})\subseteq\widehat{\partial}(\phi\circ g)(\mathbf{y})$ , since the inclusion relationship in the other direction holds from the remark of Definition 9.1.

In fact, suppose $\mathbf{u}\in\partial(\phi\circ g)(\mathbf{y})$ , by the definition, there exists $\mathbf{z}^{(k)}\to\mathbf{y},\phi(\left\|\mathbf{z}^{(k)}\right\|_{1})\to\phi(\left\|\mathbf{y}\right\|_{1})$ and $\mathbf{u}^{(k)}\in\widehat{\partial}(\phi\circ g)(\mathbf{z}^{(k)}),\mathbf{u}^{(k)}\to\mathbf{u}$ , thus $\mathbf{z}^{(k)}$ and $\mathbf{y}$ have the identical support when $k$ is sufficiently large. Based on it and from the fact

[TABLE]

we obtain that $\mathbf{u}\in\widehat{\partial}(\phi\circ g)(\mathbf{y})$ by the limit process. ∎

The regularity property of function is essential for dealing with the subdifferential of the addition of two non-smooth norms, i.e. $\ell_{1,q}$ term and $\ell_{1}$ / $\ell_{\infty}$ noise term, we give the lemma here.

Lemma 3.2 (Regularity).

Let $\mathbf{y}\in\mathbb{R}^{m}$ be the $m$ -dimensional vector, then $(\phi\circ g)(\mathbf{y})$ is regular at $\mathbf{y}$ for $p\geq 1$ .

Proof.

By [29, Corollary 8.11], $(\phi\circ g)(\mathbf{y})$ is regular at $\mathbf{y}$ if and only if

[TABLE]

In the proof of Lemma 3.1, we know that the first equality in (3.1) holds. The left is to verify the second equality.

For $\mathbf{y}=\bf 0$ , we have

[TABLE]

thus the horizon cone $(\widehat{\partial}(\phi\circ g)(\mathbf{y}))^{\infty}$ is the same set $(-\infty,\infty)^{m}$ by letting $\mathbf{v}^{(k)}=k\mathbf{v}$ and $\lambda^{(k)}=1/k$ in Definition 9.2. We can also conclude the horizon subdifferential $\partial^{\infty}(\phi\circ g)(\mathbf{y})=(-\infty,\infty)^{m}$ by the same trick.

For $\mathbf{y}\neq\bf 0$ , we have the following from Definition 9.1 and the remark of Definition 9.2:

[TABLE]

due to the boundedness of $\widehat{\partial}(\phi\circ g)(\mathbf{y})$ . ∎

*Remark**.*

From [29, Proposition 10.5] for separable functions, the sum function

[TABLE]

is also regular.

The objective function $\mathcal{E}$ in (1.1) reads

[TABLE]

which is bounded below, coercive, and continuous. It has at least one minimizer.

Now, we derive the subdifferential of $\mathcal{E}$ at $\mathbf{x}$ . From Lemma 3.2 and the remark, we know that $\sum_{\mathsf{i}\in\mathsf{G}}\phi(\left\|\mathbf{x}_{\mathsf{i}}\right\|_{p})$ is regular. For $1\leq r\leq\infty$ , $F_{r}(\mathbf{x})$ is convex and also regular. By [29, Exercise 10.9], we get

[TABLE]

The subdifferential on the first term in (3.3) can be obtained by [29, Proposition 10.5],

[TABLE]

The subdifferential factors in the right-hand term can be calculated by Lemma 3.1 according to the specific cases of $\mathbf{x}_{\mathsf{i}}$ . The subdifferential on the second term in (3.3) can be obtained by the chain rule of composite subdifferential,

[TABLE]

where the subdifferential of the infinity norm can be derived as follows. From the Danskin-Bertsekas Theorem for subdifferential in [4, Proposition A.22], it holds that

[TABLE]

Hence, the each entry of element in $\partial F_{r}(\mathbf{x})$ , denoted by $\mbox{\boldmath\small$ \eta $}_{\mathsf{i},j}(\mathbf{x}),\mathsf{i}\in\mathsf{G},j\in J_{\mathsf{i}}$ has the following representation,

[TABLE]

From the definition of the subdifferential, we have that $\mathbf{x}^{\ast}$ is a stationary point of (1.1) if and only if

[TABLE]

3.2 A motivating proposition

The following proposition inspires us to design the algorithm in the next section.

Proposition 3.3.

Suppose $\mathbf{x}\in\mathbb{R}^{N}$ has the group structure $\mathbf{x}:=\left(\mathbf{x}_{\sf 1}^{T},\mathbf{x}_{\sf 2}^{T},\cdots,\mathbf{x}_{\mathsf{g}}^{T}\right)^{T}$ . If $\mathbf{x}$ is sufficiently close to a local minimizer (or a stationary point) $\mathbf{x}^{\ast}$ of (1.1). Then it holds that

[TABLE]

Proof.

We prove (3.8) by contradiction.

As $\mathbf{x}^{\ast}$ is a local minimizer (or a stationary point) of $\mathcal{E}$ , the condition (3.7) implies that $\bf 0\in\partial\cal{E}(\mathbf{x}^{\ast})$ .

If $\mathbf{x}_{\mathsf{i}^{\prime}}^{\ast}\neq\bf 0,\mathsf{i}^{\prime}\in\mathsf{G}\setminus\operatorname{supp}_{\mathsf{G}}(\mathbf{x})$ , that is, $\mathbf{x}^{\ast}_{\mathsf{i}^{\prime},j}\neq 0$ for some $j\in J_{\mathsf{i}^{\prime}}$ . For $1<r<\infty$ , we have

[TABLE]

Summing up all the absolute values of the two terms in (3.9) for $j\in\operatorname{supp}(\mathbf{x}^{\ast}_{\mathsf{i}^{\prime}})$ , we have

[TABLE]

the left inequality holds from Lemma 2.2 for $p>1$ and from $\left\|\mathbf{x}^{\ast}_{\mathsf{i}^{\prime}}\right\|^{p-1}_{p-1}=\#\{\mbox{nonzero etries of }\mathbf{x}^{\ast}_{\mathsf{i}^{\prime}}\}$ for $p=1$ .

The right side of (3.10) is uniformly bounded in the neighborhood of $\mathbf{x}$ , and the bound is independent of $\mathbf{x}^{*}$ . Since $\mathbf{x}_{\mathsf{i}^{\prime}}^{\ast}$ can be sufficiently close to $\mathbf{x}_{\mathsf{i}^{\prime}}=0,\mathsf{i}^{\prime}\in\mathsf{G}\setminus\operatorname{supp}_{\mathsf{G}}(\mathbf{x})$ , it contradicts (3.10) by $0<q<1$ .

For $r=1$ and $r=\infty$ , we have

[TABLE]

Thus, the results can be derived similarly from the uniform boundedness of the sets $\mbox{\boldmath\small$ \eta $}_{\mathsf{i}^{\prime},j}(\mathbf{x}^{\ast})$ in the neighborhood of $\mathbf{x}$ .

∎

*Remark**.*

For the special case $r=2$ in fidelity term, [14, 21] established the lower bound theory, which can also inspire our proposition.

3.3 Algorithm

Motivated by Proposition 3.3, we propose to solve the problem (1.1) by an iterative process, which generates a sequence whose group support set is nonincreasing. Suppose that $\mathbf{x}^{(l)}$ is an approximate solution in the $l$ th iteration. In the next iteration, we minimize the objective function only on the group support set $\mathsf{S}^{(l)}$ of $\mathbf{x}$ , with the remaining group components being null. This idea yields the following iterative support shrinking algorithm (ISSA).

Initialization: Select $\mathbf{x}^{(0)}\in\mathbb{R}^{N}$ .

Iteration: For $l=0,1,\ldots$ until convergence:

Set $\mathsf{S}^{(l)}=\operatorname{supp}_{\mathsf{G}}(\mathbf{x}^{(l)})$ .

Compute $\mathbf{x}^{(l+1)}_{\mathsf{S}^{(l)}}$ by solving

$\displaystyle\min_{\mathbf{x}_{\mathsf{S}^{(l)}}}\sum_{\mathsf{i}\in\mathsf{S}^{(l)}}\phi(\left\|\mathbf{x}_{\mathsf{i}}\right\|_{p})+F_{r}^{(l)}(\mathbf{x}),$

( $\mathcal{P}_{o}$ )

where $F_{r}^{(l)}(\mathbf{x})$ is the distance of $F_{r}(\mathbf{x})$ at the $l$ -th step over the group support set $\mathsf{S}^{(l)}$ ,

$F_{r}^{(l)}(\mathbf{x})=\begin{cases}\displaystyle\frac{1}{r\alpha}\sum_{k\in I}\left|\sum_{\mathsf{i}\in\mathsf{S}^{(l)}}A_{k,\mathsf{i}}\mathbf{x}_{\mathsf{i}}-y_{k}\right|^{r},&r\geq 1,\\ \displaystyle\frac{1}{\alpha}\max_{k\in I}\left|\sum_{\mathsf{i}\in\mathsf{S}^{(l)}}A_{k,\mathsf{i}}\mathbf{x}_{\mathsf{i}}-y_{k}\right|,&r=\infty.\end{cases}$

Set

$\mathbf{x}_{\mathsf{i}}^{(l+1)}=\mbox{\boldmath\small$ 0 $},\mbox{ for }\mathsf{i}\in\mathsf{G}\setminus{\mathsf{S}}^{(l)}.$

To make ISSA more practical, each term $\phi(\left\|\mathbf{x}_{\mathsf{i}}\right\|_{p}),\mathsf{i}\in\mathsf{S}^{(l)}$ can be linearized at $\|\mathbf{x}_{\mathsf{i}}^{(l)}\|_{p}\neq 0$ . We introduce the following energy functional with proximal linearization:

[TABLE]

where $\beta\geq 0$ .

We present an inexact iterative support shrinking algorithm with proximal linearization to solve (1.1).

Initialization: Select $\mathbf{x}^{(0)}=c\mathbb{1}$ with $c\neq 0$ or randomly, where $\mathbb{1}$ is the all one vector.

Iteration: For $l=0,1,\ldots$ until convergence:

Set $\mathsf{S}^{(l)}=\operatorname{supp}_{\mathsf{G}}(\mathbf{x}^{(l)})$ . Set $\beta=0$ for $l=0$ and $\beta>0$ fixed for $l\geq 1$ .

Compute $\mathbf{x}^{(l+1)}_{\mathsf{S}^{(l)}}$ by approximately solving

$\min_{\mathbf{x}_{\mathsf{S}^{(l)}}}\mathcal{E}^{(l)}(\mathbf{x})\\$

( $\mathcal{P}_{x}$ )

such that

$\mathbf{u}^{(l)}(\mathbf{x}^{(l+1)})\in\partial\mathcal{E}^{(l)}(\mathbf{x}^{(l+1)}),\|\mathbf{u}^{(l)}(\mathbf{x}^{(l+1)})\|_{2}\leq\frac{\beta}{2}\varepsilon\|\mathbf{x}^{(l+1)}-\mathbf{x}^{(l)}\|_{2}.$

(3.13)

with the tolerance error $\varepsilon$ .

Set

$\mathbf{x}_{\mathsf{i}}^{(l+1)}=\mbox{\boldmath\small$ 0 $},\mathbf{u}_{\mathsf{i}}^{(l)}(\mathbf{x}^{(l+1)})=0,\mbox{ for }\mathsf{i}\in\mathsf{G}\setminus{\mathsf{S}}^{(l)}.$

*Remark**.*

The condition (3.13) in InISSAPL is motivated by [2, 25]. It corresponds to an inexact inner loop and a guide to select the approximate solution for ( $\mathcal{P}_{x}$ ). Due to the strong convexity of the problem ( $\mathcal{P}_{x}$ ), it can be solved to any given accuracy. Therefore, the condition (3.13) in InISSAPL can hold, as long as the problem ( $\mathcal{P}_{x}$ ) is solved sufficiently accurately.

*Remark**.*

From the motivating Proposition 3.3, $\mathbf{x}^{(0)}$ is required to be with as large support as possible. There are two strategies to choose the starting point. One is to set $\mathbf{x}^{(0)}$ by nonzero scalar multiplication of the all one vector, which yields a group lasso when $p=2$ for the first step. The other is to set $\mathbf{x}^{(0)}$ by randomly generating data of i.i.d Gaussian (with zero probability to obtain zero group member), indicating a weighted group lasso when $p=2$ . Due to the fact that $\mathbf{x}^{(0)}$ is not the proximal solution, we also set $\beta=0$ for the first step in the algorithm. The results of experiments with suggested two kinds of starting points are given in section 6.1.

For the convenience of description later, we give the representation of the subdifferential in (3.13) for $~{}\mathsf{i}\in\mathsf{S}^{(l)},~{}j\in J_{\mathsf{i}}$ ,

[TABLE]

where

[TABLE]

and

[TABLE]

4 Convergence analysis

In this section, we establish the global convergence result of the sequence generated by the InISSAPL algorithm. Theorem 9.2 in the appendix gives a celebrating theoretical framework for the convergence of sequence in decent methods. Recently it has extensive applications [1, 2, 7], especially in non-convex optimization. When we turn back to our problem, the key issue is to deal with the non-Lipschitz property of $\mathcal{E}(\mathbf{x})$ . In this paper, a lower bound theory of the iterative sequence is developed to overcome the difficulty of the non-Lipschitz property. Furthermore, due to the non-smooth property of $\mathcal{E}(\mathbf{x})$ , the construction of the element in $\partial\mathcal{E}(\mathbf{x})$ to prove the relative error condition (H2) in Theorem 9.2 is more technical.

From the iteration process, we can see that it produces a nonincreasing sequence of group support set. The lemma is given in the following.

Lemma 4.1.

The sequence $\left\{\mathsf{S}^{(l)}\right\}$ converges in a finite number of iterations, i.e., there exists an integer $L>0$ such that if $l\geq L$ , then $\mathsf{S}^{(l)}\equiv\mathsf{S}^{(L)}$ .

Proof.

Since $\mathsf{G}$ is a finite set and

[TABLE]

$\left\{\mathsf{S}^{(l)}\right\}$ converges in a finite number of iterations. ∎

In the next, we verify the conditions (H1)-(H3) in Theorem 9.2 for the sequence of the objective function $\mathcal{E}(\mathbf{x}^{(l)})$ . (H1) is the sufficient decrease condition for the sequence, and it is given in Lemma 4.2. Here we introduce the energy functional with proximal linearization once again, but defined over $\mathbf{x}\in\mathbb{R}^{N}$ :

[TABLE]

It should be noted that it is different from $\mathcal{E}^{(l)}(\mathbf{x})$ in (3.12) by the fidelity term.

Lemma 4.2.

For any $\beta>0$ and $0\leq\varepsilon<1$ , let $\left\{\mathbf{x}^{(l)}\right\}$ be a sequence generated by InISSAPL. Then

(i)

The sequence $\left\{\mathcal{E}(\mathbf{x}^{(l)})\right\}$ is nonincreasing and satisfies

[TABLE] 2. (ii)

The sequence $\left\{\mathbf{x}^{(l)}\right\}$ is bounded and satisfies $\lim_{l\to\infty}\|\mathbf{x}^{(l+1)}-\mathbf{x}^{(l)}\|_{2}=0$ .

Proof.

Due to the fact that $\phi(0)=0$ , we have

[TABLE]

When $\mathbf{x}\in\mathbb{R}^{N}$ and $\operatorname{supp}_{G}(\mathbf{x})\subseteq\mathsf{S}^{(l)}$ , we obtain

[TABLE]

Let $\widehat{\mathbf{u}}^{(l)}(\mathbf{x})\in\partial\mathcal{F}^{(l)}(\mathbf{x})$ . Then

[TABLE]

where $\mathbf{u}_{\mathsf{i},j}^{(l)}(\mathbf{x})$ is defined in (3.14) and $\mbox{\boldmath\small$ \eta $}_{\mathsf{i},j}(\mathbf{x})$ is defined in (3.6). Since for any $\mathsf{i}\in\mathsf{G}\setminus\mathsf{S}^{(l)}$ , $\mathbf{x}_{\mathsf{i}}^{(l+1)}=\mathbf{x}_{\mathsf{i}}^{(l)}=\bf{0}$ , we have

[TABLE]

Putting (4.3), (4.4) and (4.6) together, we obtain

[TABLE]

With the fact that $\mathcal{E}(\mathbf{x})$ is bounded from below and $\frac{\beta}{2}(1-\varepsilon)>0$ , it follows that $\left\{\mathcal{E}(\mathbf{x}^{(l)})\right\}$ is nonincreasing and converges to a finite value as $l\to\infty$ . Thus

[TABLE]

Because $\mathcal{E}(\mathbf{x})$ is coercive, we know that $\left\{\mathbf{x}^{(l)}\right\}$ is bounded.

∎

The following lemma is the lower bound theory on the nonzero groups of the iteration sequence, which can be used to overcome the non-Lipschitz property.

Lemma 4.3.

There are $0<c<C<\infty,L>0$ such that

[TABLE]

Proof.

From Lemma 4.1, for any $\mathsf{i}\in\mathsf{S}^{(L)}$ and $l\geq L$ , $\mathbf{x}_{\mathsf{i}}^{(l)}\neq\bf 0$ . The sequence has upper bound from Lemma 4.2,

[TABLE]

We now prove by contradiction that $\|\mathbf{x}_{\mathsf{i}}^{(l)}\|_{p}$ has nonzero lower bound for any $\mathsf{i}\in\mathsf{S}^{(L)},l\geq L$ .

Suppose there exists $\mathsf{i}^{\prime}\in\mathsf{S}^{(L)}$ for some subsequence $\mathbf{x}^{(l_{k})}$ , still denoted by $\mathbf{x}^{(l)}$ , such that

[TABLE]

By the subdifferential expression (3.14), we have for $j\in\operatorname{supp}(\mathbf{x}_{\mathsf{i}^{\prime}}^{(l+1)})$ , and $p\geq 1$ ,

[TABLE]

with the left term,

[TABLE]

Summing up all the terms for $j\in\operatorname{supp}(\mathbf{x}_{\mathsf{i}^{\prime}}^{(l+1)})$ , we have

[TABLE]

where the second inequality holds from the same reason as the motivating proposition (Proposition 3.3). It follows from the boundedness of $\left\{\mathbf{x}^{(l)}\right\}$ that $\left|\eta_{\mathsf{i}^{\prime},j}^{(l)}(\mathbf{x}^{(l+1)})\right|+\beta\left|\mathbf{x}_{\mathsf{i}^{\prime},j}^{(l+1)}-\mathbf{x}_{\mathsf{i}^{\prime},j}^{(l)}\right|$ is bounded. The condition (3.13) implies that $\left|\mathbf{u}_{\mathsf{i}^{\prime},j}^{(l)}(\mathbf{x}^{(l+1)})\right|$ is also bounded. Thus the equation (4.8) is impossible to hold when $l\to\infty$ because of $0<q<1$ .

∎

By combining Lemma 4.3 and Proposition 2.1, we can obtain the Lipschitz property over the support of group members.

[TABLE]

when $p\geq 1$ .

Using this property, we can prove the relative error condition (H2) by Lemma 4.4 in which the sequence $\mathbf{v}^{(l+1)}$ of $\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ is well constructed though $\mathcal{E}(\mathbf{x})$ is non-smooth.

Lemma 4.4.

For each $l\geq L$ , there exists $\mathbf{v}^{(l+1)}\in\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ and constant $\widetilde{C}>0$ such that

[TABLE]

Proof.

For $l\geq L$ , the vector $\mathbf{u}^{(l)}(\mathbf{x}^{(l+1)})$ in the set of $\partial\mathcal{E}^{(l)}(\mathbf{x}^{(l+1)})$ has the form in (3.14),

[TABLE]

Then the intermediate variable $\widehat{\mathbf{v}}^{(l+1)}$ is introduced as follows,

[TABLE]

The upper bound of $\widehat{\mathbf{v}}^{(l+1)}$ can be measured by the iterative error,

[TABLE]

Noting the difference of $\partial\mathcal{E}^{(l)}(\mathbf{x}^{(l+1)})$ and $\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ , we specially construct $\mathbf{v}^{(l+1)}$ to be the form,

[TABLE]

where $\mbox{\boldmath\small$ \eta $}^{(l)}_{\mathsf{i},j}(\mathbf{x}^{(l+1)})$ is the same as the part of $\widehat{\mathbf{v}}_{\mathsf{i},j}^{(l+1)}$ and

[TABLE]

Here $\psi_{\mathsf{i},j}$ in $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i},j}(\mathbf{x}^{(l+1)})$ is to be defined by the requirement of $\mathbf{v}^{(l+1)}\in\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ . On one hand, by Lemma 3.1 (i) and (3.3)-(3.4), for $\mathsf{i}\in\mathsf{G}\setminus\mathsf{S}^{(L)}$ , $\partial(\phi\circ g)(\mathbf{x}_{\mathsf{i}})=\Pi_{j\in J_{\mathsf{i}}}(-\infty,+\infty)$ and the set $\partial F_{r}(\mathbf{x}^{(l+1)})$ is bounded, then $\mathbf{v}_{\mathsf{i}}^{(l+1)}$ belongs to the corresponding entries of the element in $\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ . On the other hand, by Lemma 3.1 (ii) and (3.3)-(3.4), for $\mathsf{i}\in\mathsf{S}^{(L)}$ , it can be checked that if $\psi_{\mathsf{i},j}$ satisfies $|\psi_{\mathsf{i},j}|\leq q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1}$ , $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i}}(\mathbf{x})$ will be in $\partial(\phi\circ g)(\mathbf{x}_{\mathsf{i}})$ . Thus $\mathbf{v}_{\mathsf{i}}^{(l+1)}$ also belongs to the corresponding entries of the element in $\partial\mathcal{E}(\mathbf{x}^{(l+1)})$ . Therefore, the left is to construct $\psi_{\mathsf{i},j}$ . It is more technical. $\psi_{\mathsf{i},j}$ is determined by estimating the $\ell^{1}$ error of $\mathbf{v}^{(l+1)}$ and $\widehat{\mathbf{v}}^{(l+1)}$ in the case of $p=1,j\notin\operatorname{supp}(\mathbf{x}_{\mathsf{i}}^{(l+1)})$ later. Thus, the main idea of constructing $\psi_{\mathsf{i},j}$ is to compare $\psi_{\mathsf{i},j}\in[-q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1},q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1}]$ and $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i},j}^{(l)}(\mathbf{x}^{(l+1)})\in[-q\|\mathbf{x}_{\mathsf{i}}^{(l)}\|_{1}^{q-1},q\|\mathbf{x}_{\mathsf{i}}^{(l)}\|_{1}^{q-1}]$ in (3.14). That is, let $I=[-q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1},q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1}]$ , if $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i},j}^{(l)}(\mathbf{x}^{(l+1)})\in I$ , we choose it. Otherwise, we choose the nearest point in $I$ . Hence we choose

[TABLE]

where $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i},j}^{(l)}(\mathbf{x}^{(l+1)})$ is the part of $\mathbf{u}_{\mathsf{i},j}^{(l)}(\mathbf{x}^{(l+1)})$ . Noting that $0<q<1$ , we can check that $|\psi_{\mathsf{i},j}|\leq q\|\mathbf{x}_{\mathsf{i}}^{(l+1)}\|_{1}^{q-1}$ .

After constructing $\mbox{\boldmath\small$ \zeta $}_{\mathsf{i},j}(\mathbf{x}^{(l+1)})$ , we can now measure the difference between $\mathbf{v}^{(l+1)}$ and $\widehat{\mathbf{v}}^{(l+1)}$ . We divide this measurement into two cases: $p>1$ and $p=1$ . For $p>1$ , the $L^{1}$ norm of the difference can be bounded by

[TABLE]

where $C_{p}$ is also the coefficient of norm equivalence. For $p=1$ , it follows,

[TABLE]

where the first inequality comes from (4.12), (4.13) and (3.14).

Combining (4.11), (4.14) and (4.15) yields:

[TABLE]

where $\widetilde{C}=\max\{L_{c}C_{s}C_{p},L_{c}C_{p},\sqrt{N}\beta(2+\varepsilon)/2\}$ .

∎

(H3) is the continuity condition, and it holds naturally. From Appendix 9, we know that $\mathcal{E}(\mathbf{x})$ satisfies KL property. Finally, we establish our main convergence result.

Theorem 4.5.

The iterative sequence $\left\{\mathbf{x}^{(l)}\right\}$ generated by InISSAPL algorithm converges globally to the limit point $\mathbf{x}^{\ast}$ , which is a stationary point of problem (1.1).

Proof.

Since $\left\{\mathbf{x}^{(l)}\right\}$ is bounded (Lemma 4.3), there exists a subsequence $(\mathbf{x}^{(k_{l})})$ and $\mathbf{x}^{\ast}$ such that

[TABLE]

By combing (4.2), (4.10) and (4.16), and by Theorem 9.2 in the appendix, the sequence $\left\{\mathbf{x}^{(l)}\right\}$ converges globally to the limit point $\mathbf{x}^{\ast}$ , which is a stationary point of $\mathcal{E}$ . ∎

5 Algorithm Implementation

For each iteration step in InISSAPL algorithm, it is a weighted $\ell_{p,1}-\ell_{r}(~{}p\geq 1,~{}r\geq 1)$ minimization in essence. It is convex and the inexact inner loop is allowed in implementation. Some standard methods like ADMM [8], split Bregman method [20, 37] and primal-dual algorithm [11, 19] can be used to efficiently solve it. Here we adopt scaled ADMM.

5.1 Scaled ADMM

a At each $l$ -th step in InISSAPL, it is equivalently to solving ( $\mathcal{P}_{x}$ ) by

[TABLE]

over group support set $\mathsf{S}^{(l)}$ . For the brevity of notations, we still use the boldface $\mathbf{x},\mathbf{y},\mathbf{z},\cdots$ to denote the vectors on $\mathsf{S}^{(l)}$ in the following.

Equivalently, we can solve the following constrained optimization problem by

[TABLE]

where

[TABLE]

We introduce the penalty parameters $\rho_{1},\rho_{2}>0$ (denoted by $\rho=(\rho_{1},\rho_{2})$ ) and the Lagrangian multipliers $\mbox{\boldmath\small$ \lambda $},\mbox{\boldmath\small$ \mu $}$ , then the scaled augmented Lagrangian functional for the weighted problem (5.2) at $l$ -th step is the following:

[TABLE]

The scaled ADMM for solving (5.2) is described as follows. When there is no confusion with the notations, we use $\bar{\mathbf{x}}^{(i)},\mathbf{s}^{(i)},\mathbf{z}^{(i)}$ to denote the $i$ -th iteration step in the inner loop of scaled ADDM.

Initialization: Start with $\bar{\mathbf{x}}^{(0)}=\mathbf{x}^{(l)}_{\mathsf{S}^{(l)}},\mbox{\boldmath\small$ \lambda $}^{(0)}=\mathbf{0},\mbox{\boldmath\small$ \mu $}^{(0)}=\mathbf{0}$ .

Iteration: For $i=0,1,\ldots,\textmd{MAXit}$ ,

Compute

$(\mathbf{z}^{(i+1)},\mathbf{s}^{(i+1)})=\arg\min_{\mathbf{z},\mathbf{s}}\mathcal{L}^{(l)}_{\rho}(\bar{\mathbf{x}}^{(i)},\mathbf{z},\mathbf{s};\mbox{\boldmath\small$ \lambda $}^{(i)},\mbox{\boldmath\small$ \mu $}^{(i)}).$

(5.3)

Compute

$\bar{\mathbf{x}}^{(i+1)}=\arg\min_{\bar{\mathbf{x}}}\mathcal{L}^{(l)}_{\rho}(\bar{\mathbf{x}},\mathbf{z}^{(i+1)},\mathbf{s}^{(i+1)};\mbox{\boldmath\small$ \lambda $}^{(i)},\mbox{\boldmath\small$ \mu $}^{(i)}).$

(5.4)

Update

$\displaystyle\boldsymbol{\lambda}^{(i+1)}$ $\displaystyle=\boldsymbol{\lambda}^{(i)}+\mathbf{A}\bar{\mathbf{x}}^{(i+1)}-\mathbf{y}-\mathbf{s}^{(i+1)},$

(5.5)

$\displaystyle\boldsymbol{\mu}^{(i+1)}$ $\displaystyle=\boldsymbol{\mu}^{(i)}+\bar{\mathbf{x}}^{(i+1)}-\mathbf{z}^{(i+1)}.$

(5.6)

5.2 Solving (5.3) and (5.4)

The subproblems (5.3) and (5.4) can be efficiently solved.

(i)

The minimization subproblem in (5.3) is equivalently to solving

[TABLE]

which can be separated into two independent subproblems.

(a)

$\mathbf{z}$ -minimization problem:

[TABLE]

For $p=1$ , we have the explicit solution by [37],

[TABLE]

For $p=2$ , this group problem is separable, the minimizer of it can be also explicitly given by the shrinkage lemma in [36, 32, 33]:

[TABLE]

For the general $p>1$ , it is strongly convex, we can use standard nonlinear numerical methods, such as Newton method to solve it. 2. (b)

$\mathbf{s}$ -minimization problem:

[TABLE]

For $r=1$ , it is a same problem as $\mathbf{z}$ -minimization one for $p=1$ , we omit it here.

For $r=2$ , the solution can be obtained easily,

[TABLE]

For general $r>1$ , we also can use the standard nonlinear numerical methods to solve it efficiently.

For $r=\infty$ , the $\mathbf{s}$ -minimization problem reads,

[TABLE]

Let $\widetilde{\mathbf{s}},\widetilde{\mathbf{v}}$ are sorted from $\mathbf{s},\mathbf{v}$ by the absolute values of elements of the known vector $\mathbf{v}$ in ascending order, it is equivalent to solving,

[TABLE]

Its optimal solution can be obtained by Theorem 5.1 in the next subsection,

[TABLE]

where $i^{\ast}\in\{0,1,,2,\cdots,n-1\}$ and $t_{i^{\ast}}$ satisfies (5.10). 2. (ii)

The minimization problem in (5.4) is equivalent to solving

[TABLE]

The optimality condition is a linear system like,

[TABLE]

We can solve it by the inverse of a symmetric positive-definite matrix.

*Remark**.*

In fact, when $r=2$ , it is unnecessary to introduce the variable $\mathbf{s}$ . The scaled ADMM can be simplified in this case.

5.3 The analytical solution for the $\mathbf{s}$ -problem with infinity norm

Now we consider the equivalent $\mathbf{s}$ -minimization problem for $r=\infty$ in (5.7). It is strongly convex, so it has a unique solution.

Theorem 5.1.

Suppose $\widetilde{\mathbf{s}},\widetilde{\mathbf{v}}\in\mathbb{R}^{n}$ , and the elements of $\widetilde{\mathbf{v}}$ is in ascending order by $|\widetilde{\mathbf{v}}_{1}|\leq|\widetilde{\mathbf{v}}_{2}|\cdots\leq|\widetilde{\mathbf{v}}_{n}|$ , then the minimization problem

[TABLE]

has the explicit optimal solution,

[TABLE]

where $i^{\ast}$ is a specific element of $\{0,1,\cdots,n-1\}$ such that

[TABLE]

holds simultaneously.

Proof.

Suppose $s_{\infty}=\|\widetilde{\mathbf{s}}\|_{\infty}$ . The minimization problem (5.8) can be rewritten to be more simple,

[TABLE]

We remark here if $s_{\infty}>|\widetilde{\mathbf{v}}_{n}|$ , the minimizer is $s_{\infty}=|\widetilde{\mathbf{v}}_{n}|$ when $\widetilde{\mathbf{s}}=\widetilde{\mathbf{v}}$ . This is a contradiction. Hence we can replace $s_{\infty}$ by $0\leq t\leq|\widetilde{\mathbf{v}}_{n}|$ , and the minimization problem (5.11) can be modified to be

[TABLE]

In fact, the objective functional $f(t)$ is a piecewise continuous function. Letting $\widetilde{\mathbf{v}}_{0}=0$ , we have

[TABLE]

and

[TABLE]

For $i=1,2,\cdots,n-1$ , the right limit of the derivative of $f(t)$ at $t=|\widetilde{\mathbf{v}}_{i}|$ is,

[TABLE]

similarly, the left limit of the derivative of $f(t)$ at $t=|\widetilde{\mathbf{v}}_{i}|$ is

[TABLE]

Since $f(t)$ is continuous at $|\widetilde{\mathbf{v}}_{i}|$ and $f^{\prime}(|\widetilde{\mathbf{v}}_{i}|+0)=f^{\prime}(|\widetilde{\mathbf{v}}_{i}|-0)$ , $f(t)$ is continuously differentiable.

Furthermore, from (5.12), we know that the derivative of $f(t)$ is monotonically increasing. Hence $f(t)$ is convex. Thus $f^{\prime}(t)=0$ can give us the optimal solution of the simplified problem (5.11). Let

[TABLE]

If there exists $i^{\ast}$ such that $t_{i^{\ast}}\in[|\widetilde{\mathbf{v}}_{i^{\ast}}|,|\widetilde{\mathbf{v}}_{i^{\ast}+1}|]$ , then $t_{i^{\ast}}$ is the minimizer. Evidently, the optimal solution of minimization (5.8) can be given by (5.9).

∎

6 Numerical Experiments

Numerical experiments are reported in this section to show the efficiency of the InISSAPL algorithm. All of them are implemented on a Laptop (Intel(R) Core(TM) Duo i5-7200u @2.50GHz 2.70GHz, 4.00GB RAM) using Matlab(License ID:1108635).

We consider the numerical tests of application in group sparse signal recovery. Let $\mathbf{x}_{or}$ denote the group sparse original signal, which is generated by randomly splitting its components into $\mathsf{g}$ groups. For each nonzero group member, its entries are randomly generated as i.i.d. Gaussian. Suppose that $\mathbf{B}\in\mathbb{R}^{M\times N}$ is randomly generated by an i.i.d. Gaussian ensemble. We let $\mathbf{A}$ be the row orthogonalized matrix of $\mathbf{B}$ by $\mathbf{A}=(orth(\mathbf{B}^{\prime}))^{\prime}$ in Matlab code. Then the measurement $\mathbf{y}$ is get by

[TABLE]

where $\sigma$ is the noise level and $noise$ represents the three popular ones, Laplace noise, Gaussian noise and uniform noise.

We denote by $s$ the number of nonzero groups of the original signal $\mathbf{x}_{or}$ . Then the sparsity level $k_{s}$ is defined by $k_{s}=s/\mathsf{g}$ . For simplicity, we consider the uniform group partitions that we have the same group size, denoted by $n$ . Define the relative recovery error $\epsilon$ by

[TABLE]

In our numerical experiments, we set $M=256,N=1024$ for the size of problem, $\sigma=0.001$ for the noise level and $n=8$ for the uniform group size, unless otherwise mentioned. The recovery is recognized as success when the relative error $\epsilon$ is less than $1\%$ . For the iteration stopping criteria in the InISSAPL algorithm, we use the same criterion as in [8] by setting $\epsilon^{\mbox{abs}}=\epsilon^{\mbox{rel}}=10^{-3}$ in the inner scaled ADMM loop, where

[TABLE]

with

[TABLE]

We adopt the stopping criterion $\|\mathbf{x}^{(l+1)}-\mathbf{x}^{(l)}\|_{2}/\|\mathbf{x}^{(l)}\|_{2}\leq 10^{-3}$ for the outer iteration. The maximal iteration numbers are set to MAXit=1000 in the ADMM and MAX=100 in the outer iteration.

6.1 Experiments on the initialization of the InISSAPL

We report the results of experiments when the different starting points are chosen in InISSAPL algorithm. The first kind of starting points are $c\mathbb{1}$ with $c\neq 0$ . We choose $c=1$ in the test. By setting $p=2,q=0.5,r=2$ for Gaussian noise, we compute the relative errors $\epsilon$ . The second kind of starting points are randomly generated as i.i.d. Gaussian. We compute the average relative error $\bar{\epsilon}$ of 1000 different starting points for the same problem setting as in the first kind.

The experiments are performed for different signal recovery problems with three sensing matrices $\mathbf{A}_{1},\mathbf{A}_{2},\mathbf{A}_{3}$ and three sparsity cases $s=8,s=16,s=24$ . The comparisons are displayed in Table 1.

It shows that the InISSAPL algorithm is effective and not sensitive to the choice of suggested starting points, even for the less sparsity case $s=24$ . Based on this fact, we will choose vector with ones in all elements as starting point in the following experiments.

The InISSAPL algorithm covers many cases for different choices of $p,q,r$ . We discuss them separately in the following subsections.

6.2 Accessible to diversity of noise

Our algorithm is applicable to different types of noise. Here we fix $q=1/2,p=2$ and noise level $\sigma=0.01$ to show the performance for three kinds of noise, Laplace noise, Gaussian noise, and uniform distribution noise.

For a specific case of noise, we compare the relative error in Table 2 when the fidelity term uses different $\ell_{r}~{}(r=1,2,\infty)$ norms. It is clearly illustrated that $r=1$ is best for Laplace noise, $r=2$ is best for Gaussian noise and $r=\infty$ is best for uniform noise.

6.3 Choice of $p$ and $q$

We discuss numerically the InISSAPL algorithm on the parameters $p,q$ in the $\ell_{p,q}$ regularization term. Firstly, letting $p=r=2$ , we test the algorithm when $q$ varies among $\{0.1,0.3,0.5,0.7,0.9\}$ . The rate of success on sparsity level is demonstrated in Figure 1. It shows that the algorithm performs best when $q=1/2$ . This fact is consistent with the numerical results in [21, 34].

Secondly, we examine the algorithm on commonly used $p=1$ and $p=2$ for the three kinds of noise with $q=1/2$ . As suggested in the former Subsection, we use $r=1$ for Laplace noise, $r=2$ for Gaussian noise and $r=\infty$ for uniform noise, respectively. We compare the rate of success on sparsity level in Figure 2. It can be observed that the rate of success with $p=1$ is better than it with $p=2$ for Laplace noise and conversely for Gaussian noise. For uniform noise, it has no essential numerical difference between $p=1$ and $p=2$ . These results show that different $p$ values may apply to a specific model.

6.4 Sensitivity analysis on group size

In this subsection, we study the sensitivity of our algorithm on group size. We implement the experiments to show the rate of success over the different group sizes ( $n=4,8,16,32$ ) for three types of noise. Similarly as before, we set $r=1,q=1/2$ for Laplace noise, $r=2,q=1/2$ for Gaussian noise and $r=\infty,q=1/2$ for uniform noise. The sensitivity results are given in Figure 3 with $p=1$ and $p=2$ . It shows that the larger the group size, the higher the rate of success. This fact is true because more information is included for larger group size.

6.5 Comparison with some state-of-the-art algorithms

We compare the InISSAPL algorithm with others in the existing works for the group sparse model. The algorithms are typically PGM-GSO [21] and the convex optimization Group Lasso [8]. In the code of PGM-GSO algorithm (available online https://CRAN.R-project.org/package=GSparO), there is an additional input: the number of nonzero groups $s$ . In our experiments, PGM-GSO denotes their algorithm with EXACT $s$ of the ground truth. Since, in applications, it is hard to know $s$ of the ground truth exactly, we also use an estimated value $s_{e}$ (close to the true value $s$ ) with $s_{e}=s+2$ in the experiments for more tests. The PGM-GSO with estimated $s_{e}$ is named e-PGM-GSO. The comparison on rate of success is demonstrated in Figure 4 by setting the parameters $p=2,q=1/2,r=2,n=8$ for Gaussian noise. We can see that the rates of success of PGM-GSO (with exact $s$ of the number of nonzero groups of the ground truth) and our InISSAPL are similar, which are considerably higher than e-PGM-GSO and Group Lasso. Note that our InISSAPL does NOT require to input the number of nonzero groups.

For the competitive algorithms, InISSAPL, PGM-GSO, and e-PGM-GSO, we compare the running time and relative error for different sized problems in Table 3. It is illustrated that InISSAPL is more efficient than PGM-GSOers, especially for larger scale problems. The reason is that the computation is implemented only on the shrinking group support set.

7 Conclusions

The group sparse $\ell_{p,q}$ - $\ell_{r}$ model is very useful in many applications. The InISSAPL algorithm provides a unified framework to deal with all the cases of parameters $p\geq 1,0<q<1,1\leq r\leq\infty$ . When proving the global convergence of algorithm with KL property, we develop a lower bound theory for the nonzero groups of the iterative sequence to avoid the non-Lipschitz feature and construct a sophisticated subdifferential formula. Along iterations, the unknowns become fewer and fewer and can be calculated by the scaled ADMM in the inner loop. Therefore it is specially efficient for large-scale problems. Numerical experiments and comparisons demonstrate the good performance of our algorithm.

In our future work, the model and algorithm can be extended to other applications with overlapping groups structure such as the gene expression data and the patch patterns in image processing.

8 Acknowledgements

We greatly appreciate helpful discussions with Xue Feng, and thank the authors of [21] for providing their code available online https://CRAN.R-project.org/package=GSparO.

9 Appendix

We firstly recall the basic definitions of subdifferential and horizon cone from the reference [29].

Definition 9.1 (Subdifferentials).

Let $h:\mathbb{R}^{N}\to\mathbb{R}\cup\{+\infty\}$ be a proper, lower semicontinuous function.

(i)

The regular subdifferential of $h$ at $\bar{\mathbf{x}}\in\operatorname{dom}h=\{\mathbf{x}\in\mathbb{R}^{N}:h(\mathbf{x})<+\infty\}$ is defined as

[TABLE] 2. (ii)

The (limiting) subdifferential of $h$ at $\bar{\mathbf{x}}\in\operatorname{dom}h$ is defined as

[TABLE] 3. (iii)

The horizon subdifferential of $h$ at $\bar{\mathbf{x}}\in\operatorname{dom}h$ is defined as

[TABLE]

*Remark**.*

From Definition 9.1, the following properties hold:

(i)

For any $\bar{\mathbf{x}}\in\operatorname{dom}h$ , $\widehat{\partial}h(\bar{\mathbf{x}})\subseteq\partial h(\bar{\mathbf{x}})$ . If $h$ is continuously differentiable at $\bar{\mathbf{x}}$ , then $\widehat{\partial}h(\bar{\mathbf{x}})=\partial h(\bar{\mathbf{x}})=\left\{\nabla h(\bar{\mathbf{x}})\right\}$ ; 2. (ii)

For any $\bar{\mathbf{x}}\in\operatorname{dom}h$ , the subdifferential set $\partial h(\bar{\mathbf{x}})$ is closed, i.e,

[TABLE]

Definition 9.2 (Horizon cone).

For a set $C\subset\mathbb{R}^{N}$ , the horizon cone is the closed cone $C^{\infty}$ given by

[TABLE]

*Remark**.*

A set $C\subset\mathbb{R}^{N}$ is bounded if and only if its horizon cone is just the zero cone: $C^{\infty}=\{\bf 0\}$ .

Secondly, the Kurdyka-Łojasiewicz (KL) property [26, 22] is a useful tool for establishing the convergence of bounded sequence. It allows to cover a wide range of problems [2].

Definition 9.3 (Kurdyka-Łojasiewicz Property).

[1] A proper function $h$ is said to have the Kurdyka-Łojasiewicz property at $\bar{\mathbf{x}}\in\operatorname{dom}\partial h=\{\mathbf{x}\in\mathbb{R}^{\mathsf{N}}:\partial h(\mathbf{x})\neq\emptyset\}$ if there exist $\zeta\in(0,+\infty]$ , a neighborhood $U$ of $\bar{\mathbf{x}}$ , and a continuous concave function $\varphi:[0,\zeta)\to\mathbb{R}_{+}$ such that

(i)

$\varphi(0)=0$ ; 2. (ii)

$\varphi(0)$ is $C^{1}$ on $(0,\zeta)$ ; 3. (iii)

for all $s\in(0,\zeta)$ , $\varphi^{\prime}(s)>0$ ; 4. (iv)

for all $\mathbf{x}\in U$ satisfying $h(\bar{\mathbf{x}})<h(\mathbf{x})<h(\bar{\mathbf{x}})+\zeta$ , the Kurdyka-Łojasiewicz inequality holds:

[TABLE]

where $\operatorname{dist}(0,\partial h(\mathbf{x}))=\min\{\|\mathbf{v}\|:\mathbf{v}\in\partial h(\mathbf{x})\}$ ,

A proper, lower semicontinuous function $h$ satisfying the KL property at all points in $\operatorname{dom}\partial h$ is called a KL function. One can refer to [2, 7] for examples of KL functions and the application of KL property in optimization theory.

Recently, the KL property has been extended to the definable functions in an o-minimal structure for the nonsmooth version, see [22, 18, 1, 6] and the reference therein. The following definitions and theorem are based on them.

Definition 9.4.

[1] Let $\mathcal{O}=\{{\mathcal{O}}_{n}\}_{n\in\mathbb{N}}$ be such that each ${\mathcal{O}}_{n}$ is a collection of subsets of $\mathbb{R}^{n}$ . The family $\mathcal{O}$ is an o-minimal structure over $\mathbb{R}$ , if it satisfies the following axioms:

(i)

Each ${\mathcal{O}}_{n}$ is a boolean algebra. Namely $\emptyset\in{\cal{O}}_{n}$ and for each $A,B\in{\mathcal{O}}_{n}$ , $A\cup B,A\cap B$ , and $\mathbb{R}^{n}\backslash A$ belong to ${\mathcal{O}}_{n}$ . 2. (ii)

For all $A\in{\mathcal{O}}_{n}$ , $A\times\mathbb{R}$ and $\mathbb{R}\times A$ belong to ${\mathcal{O}}_{n+1}$ . 3. (iii)

For all $A\in{\mathcal{O}}_{n+1}$ , $\prod(A):=\{(x_{1},\cdots,x_{n})\in\mathbb{R}^{n}|(x_{1},\cdots,x_{n},x_{n+1})\in A\}$ belongs to $\mathcal{O}_{n}$ . 4. (iv)

For all $i\neq j$ in $\{1,2,\cdots,n\}$ , $\{(x_{1},\cdots,x_{n})\in\mathbb{R}^{n}|x_{i}=x_{j}\}$ belong to $\mathcal{O}_{n}$ . 5. (v)

The set $\{(x_{1},x_{2})\in\mathbb{R}^{2}|x_{1}<x_{2}\}$ belongs to $\mathcal{O}_{2}$ . 6. (vi)

The elements of $\mathcal{O}_{1}$ are exactly finite unions of intervals.

Definition 9.5.

[1] Given an o-minimal structure $\mathcal{O}$ over $\mathbb{R}$ . A set $C$ is said to be definable (in $\mathcal{O}$ ) if $C$ belongs to $\mathcal{O}$ . A function $f:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is said to be definable in $\mathcal{O}$ if its graph belongs to ${\mathcal{O}}_{n+1}$ .

Then the definable function has the following property:

•

finite sums of definable functions are definable;

•

compositions of definable functions are definable;

•

function of $f(y)=\sup_{x\in C}g(x,y)$ is definable if $g(x,y)$ and the set $C$ are definable.

As an example [18, 1], there exists an o-minimal structure containing the graph of $x^{r}:\mathbb{R}\to\mathbb{R},r\in\mathbb{R}$ , which is given by

[TABLE]

Theorem 9.1.

[1]** Any proper lower semicontinuous function $f:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ that is definable in an o-minimal structure $\mathcal{O}$ has the Kurdyka-Łojasiewicz property at each point of $\operatorname{dom}\partial f$ .

From this theorem and Definition 9.5, the objective function $\mathcal{E}$ in this paper is the compositions of definable functions. So it satisfies the KL property.

The following theorem gives a general and important theoretical framework for the convergence of sequence. It has extensive applications recently [2, 7].

Theorem 9.2.

[2, 7]** Let $f:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ be a proper lower semicontinous function. Consider a sequence $\{\mathbf{x}^{(l)}\}$ that satisfies

(H1). (Sufficient decrease condition). For each $l$ ,

[TABLE]

(H2). (Relative error condition). For each $l$ , there exists $w^{(l+1)}\in\partial f(\mathbf{x}^{(l+1)})$ such that

[TABLE]

(H3). (Continuity condition). There exists a subsequence $\{\mathbf{x}^{(k_{l})}\}$ and $\widetilde{\mathbf{x}}$ such that

[TABLE]

If $f$ has the KL property at the cluster point $\widetilde{\mathbf{x}}$ specified in (H3), then the sequence $\{\mathbf{x}^{(l)}\}$ converges to $\bar{\mathbf{x}}=\tilde{\mathbf{x}}$ as $l\to\infty$ and $\bar{\mathbf{x}}$ is a critical point.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. , 35(2):438–457, 2010.
2[2] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. , 137(1-2):91–129, 2013.
3[3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2(1):183–202, 2009.
4[4] D. P. Bertsekas. Control of uncertain systems with a set-membership description of the uncertainty. Ph D thesis , May, 1971
5[5] W. Bian and X. Chen. Worst-case complexity of smoothing quadratic regularization methods for non-Lipschitzian optimization. SIAM J. Optim. , 23(3):1718-1741, 2013.
6[6] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM J. Optim. , 18(2):556–572, 2007
7[7] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. , 146(1-2):459–494, 2014.
8[8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. , 3(1):1–122, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

An Efficient and Globally Convergent Algorithm for ℓp,q\ell_{p,q}ℓp,q​-ℓr\ell_{r}ℓr​ Model in Group Sparse Optimization

Abstract

1 Introduction

2 Notations and preliminaries

Proposition 2.1**.**

Lemma 2.2**.**

Proof.

Lemma 2.3**.**

Proof.

3 Motivation and the proposed algorithm

3.1 Subdifferentials and regularity

Lemma 3.1** (Subdifferential).**

Proof.

Lemma 3.2** (Regularity).**

Proof.

Remark*.*

3.2 A motivating proposition

Proposition 3.3**.**

Proof.

Remark*.*

3.3 Algorithm

Remark*.*

Remark*.*

4 Convergence analysis

Lemma 4.1**.**

Proof.

Lemma 4.2**.**

Proof.

Lemma 4.3**.**

Proof.

Lemma 4.4**.**

Proof.

Theorem 4.5**.**

Proof.

5 Algorithm Implementation

5.1 Scaled ADMM

5.2 Solving (5.3) and (5.4)

Remark*.*

5.3 The analytical solution for the s\mathbf{s}s-problem with infinity norm

Theorem 5.1**.**

Proof.

6 Numerical Experiments

6.1 Experiments on the initialization of the InISSAPL

6.2 Accessible to diversity of noise

6.3 Choice of ppp and qqq

6.4 Sensitivity analysis on group size

6.5 Comparison with some state-of-the-art algorithms

7 Conclusions

8 Acknowledgements

9 Appendix

Definition 9.1** (Subdifferentials).**

Remark*.*

Definition 9.2** (Horizon cone).**

Remark*.*

Definition 9.3** (Kurdyka-Łojasiewicz Property).**

Definition 9.4**.**

Definition 9.5**.**

Theorem 9.1**.**

Theorem 9.2**.**

An Efficient and Globally Convergent Algorithm for $\ell_{p,q}$ - $\ell_{r}$ Model in Group Sparse Optimization

Proposition 2.1.

Lemma 2.2.

Lemma 2.3.

Lemma 3.1 (Subdifferential).

Lemma 3.2 (Regularity).

*Remark**.*

Proposition 3.3.

*Remark**.*

*Remark**.*

*Remark**.*

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Lemma 4.4.

Theorem 4.5.

*Remark**.*

5.3 The analytical solution for the $\mathbf{s}$ -problem with infinity norm

Theorem 5.1.

6.3 Choice of $p$ and $q$

Definition 9.1 (Subdifferentials).

*Remark**.*

Definition 9.2 (Horizon cone).

*Remark**.*

Definition 9.3 (Kurdyka-Łojasiewicz Property).

Definition 9.4.

Definition 9.5.

Theorem 9.1.

Theorem 9.2.