An Iteratively Re-weighted Method for Problems with Sparsity-Inducing   Norms

Feiping Nie; Zhanxuan Hu; Xiaoqian Wang; Rong Wang; Xuelong Li; Heng; Huang

arXiv:1907.01121·cs.LG·July 3, 2019

An Iteratively Re-weighted Method for Problems with Sparsity-Inducing Norms

Feiping Nie, Zhanxuan Hu, Xiaoqian Wang, Rong Wang, Xuelong Li, Heng, Huang

PDF

Open Access

TL;DR

This paper introduces an iteratively re-weighted algorithm with proven convergence for efficiently solving complex sparsity-inducing norm problems in machine learning, demonstrating superior performance in feature selection tasks.

Contribution

The paper presents a novel IRW method with convergence guarantees, applicable to various intractable sparsity problems, and validates its effectiveness through real-data experiments.

Findings

01

IRW method converges reliably and quickly.

02

Outperforms alternative optimization methods in feature selection.

03

Effective for complex sparsity-inducing norm problems.

Abstract

This work aims at solving the problems with intractable sparsity-inducing norms that are often encountered in various machine learning tasks, such as multi-task learning, subspace clustering, feature selection, robust principal component analysis, and so on. Specifically, an Iteratively Re-Weighted method (IRW) with solid convergence guarantee is provided. We investigate its convergence speed via numerous experiments on real data. Furthermore, in order to validate the practicality of IRW, we use it to solve a concrete robust feature selection model with complicated objective function. The experimental results show that the model coupled with proposed optimization method outperforms alternative methods significantly.

Tables1

Table 1. TABLE I: Machine learning tasks with Sparsity-Inducing Norms

Task	Model
Subspace clustering [2, 3]	$\min_{W \in 𝒞} {‖ X - X W ‖}_{0} + μ {‖ W ‖}_{0}$
Subspace clustering [2, 3]	$\min_{W \in 𝒞} {‖ X - X W ‖}_{0} + μ r a n k (W)$
Multi-task Learning [4, 15]	$\min_{W} \sum_{i = 1}^{K} {‖ X_{i}^{T} w_{i} - y_{i} ‖}_{2}^{2} + μ {‖ W ‖}_{2, 0}$
Multi-task Learning [4, 15]	$\min_{W} \sum_{i = 1}^{K} {‖ X_{i}^{T} w_{i} - y_{i} ‖}_{2}^{2} + μ r a n k (W)$
RPCA [16]	$\min_{W \in 𝒞} {‖ X - W ‖}_{0} + μ r a n k (W)$
Matrix Completion [17]	$\min_{W \in 𝒞} {‖ P_{Ω} (X - W) ‖}_{0} + μ r a n k (W)$
Feature Selection [1]	$\min_{W} {‖ X^{T} W - Y ‖}_{2, 0} + μ {‖ W ‖}_{2, 0}$

Equations140

x \in C min f (x) + μg (x),

x \in C min f (x) + μg (x),

∥ M ∥_{r, p} = i = 1 \sum n (j = 1 \sum m ∣ m_{ij} ∣^{r})^{\frac{p}{r}}^{\frac{1}{p}} = (i = 1 \sum n m^{i}_{r}^{p})^{\frac{1}{p}} .

∥ M ∥_{r, p} = i = 1 \sum n (j = 1 \sum m ∣ m_{ij} ∣^{r})^{\frac{p}{r}}^{\frac{1}{p}} = (i = 1 \sum n m^{i}_{r}^{p})^{\frac{1}{p}} .

∥ M ∥_{S_{p}} = (i = 1 \sum σ_{i}^{p})^{\frac{1}{p}} = (t r ((M^{T} M)^{\frac{p}{2}}))^{\frac{1}{p}},

∥ M ∥_{S_{p}} = (i = 1 \sum σ_{i}^{p})^{\frac{1}{p}} = (t r ((M^{T} M)^{\frac{p}{2}}))^{\frac{1}{p}},

x \in C min f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x))^{\frac{p}{2}}) .

x \in C min f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x))^{\frac{p}{2}}) .

tr((g_{i}^{T}(x)g_{i}(x))^{\frac{p}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|^{p}}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}^{p}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{S_{p}}^{p}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.\,.

tr((g_{i}^{T}(x)g_{i}(x))^{\frac{p}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|^{p}}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}^{p}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{S_{p}}^{p}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.\,.

tr((g_{i}^{T}(x)g_{i}(x))^{\frac{1}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{*}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.

tr((g_{i}^{T}(x)g_{i}(x))^{\frac{1}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{*}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.

x \in C min f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x) + δ I)^{\frac{p}{2}}) .

x \in C min f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x) + δ I)^{\frac{p}{2}}) .

\mathop{\lim}\limits_{\delta\to 0}tr((g_{i}^{T}(x)g_{i}(x)+\delta I)^{\frac{p}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|^{p}}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}^{p}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{S_{p}}^{p}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.\,.

\mathop{\lim}\limits_{\delta\to 0}tr((g_{i}^{T}(x)g_{i}(x)+\delta I)^{\frac{p}{2}})=\left\{{\begin{array}[]{*{20}c}{\left|{g_{i}(x)}\right|^{p}}&{g_{i}(x)\;is\;scalar}\\ {\left\|{g_{i}(x)}\right\|_{2}^{p}}&{g_{i}(x)\;is\;vector}\\ {\left\|{g_{i}(x)}\right\|_{S_{p}}^{p}}&{g_{i}(x)\;is\;matrix}\\ \end{array}}\right.\,.

\frac{\partial h ( g ( x ))}{\partial x} = \frac{i , j \sum \frac{\partial h ( g ( x ))}{\partial g _{ij} ( x )} \partial g _{ij} ( x )}{\partial x} = \frac{t r ( ( \frac{\partial h ( g ( x ))}{\partial g ( x )} ) ^{T} \partial g ( x ) )}{\partial x}

\frac{\partial h ( g ( x ))}{\partial x} = \frac{i , j \sum \frac{\partial h ( g ( x ))}{\partial g _{ij} ( x )} \partial g _{ij} ( x )}{\partial x} = \frac{t r ( ( \frac{\partial h ( g ( x ))}{\partial g ( x )} ) ^{T} \partial g ( x ) )}{\partial x}

\frac{\partial t r (( g ^{T} ( x ) g ( x ) + δ I ) ^{\frac{p}{2}} )}{\partial x} = \frac{t r ( 2 \frac{p}{2} ( g ^{T} ( x ) g ( x ) + δ I ) ^{\frac{p - 2}{2}} g ^{T} ( x ) \partial g ( x ) )}{\partial x} .

\frac{\partial t r (( g ^{T} ( x ) g ( x ) + δ I ) ^{\frac{p}{2}} )}{\partial x} = \frac{t r ( 2 \frac{p}{2} ( g ^{T} ( x ) g ( x ) + δ I ) ^{\frac{p - 2}{2}} g ^{T} ( x ) \partial g ( x ) )}{\partial x} .

\frac{\partial h ( x )}{\partial x} = 2 \frac{p}{2} x (x^{T} x + δ I)^{\frac{p - 2}{2}},

\frac{\partial h ( x )}{\partial x} = 2 \frac{p}{2} x (x^{T} x + δ I)^{\frac{p - 2}{2}},

\frac{\partial h ( g ( x ))}{\partial g ( x )} = 2 \frac{p}{2} g (x) (g^{T} (x) g (x) + δ I)^{\frac{p - 2}{2}} .

\frac{\partial h ( g ( x ))}{\partial g ( x )} = 2 \frac{p}{2} g (x) (g^{T} (x) g (x) + δ I)^{\frac{p - 2}{2}} .

\frac{\partial t r ( g ^{T} ( x ) g ( x ) D )}{\partial x} = \frac{t r ( 2 D g ^{T} ( x ) \partial g ( x )) )}{\partial x} .

\frac{\partial t r ( g ^{T} ( x ) g ( x ) D )}{\partial x} = \frac{t r ( 2 D g ^{T} ( x ) \partial g ( x )) )}{\partial x} .

L (x, λ) = f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x) + δ I)^{\frac{p}{2}}) - r (x, λ),

L (x, λ) = f (x) + μ i \sum t r ((g_{i}^{T} (x) g_{i} (x) + δ I)^{\frac{p}{2}}) - r (x, λ),

\frac{\partial L ( x , λ )}{\partial x} = f^{'} (x) + μ i \sum \frac{\partial t r (( g _{i}^{T} ( x ) g _{i} ( x ) + δ I ) ^{\frac{p}{2}} )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

\frac{\partial L ( x , λ )}{\partial x} = f^{'} (x) + μ i \sum \frac{\partial t r (( g _{i}^{T} ( x ) g _{i} ( x ) + δ I ) ^{\frac{p}{2}} )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

f^{'} (x) + μ i \sum \frac{t r ( 2 \frac{p}{2} ( g _{i}^{T} ( x ) g _{i} ( x ) + δ I ) ^{\frac{p - 2}{2}} g _{i}^{T} ( x ) \partial g _{i} ( x ) )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

f^{'} (x) + μ i \sum \frac{t r ( 2 \frac{p}{2} ( g _{i}^{T} ( x ) g _{i} ( x ) + δ I ) ^{\frac{p - 2}{2}} g _{i}^{T} ( x ) \partial g _{i} ( x ) )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

f^{'} (x) + μ i \sum \frac{t r ( 2 D _{i} g _{i}^{T} ( x ) \partial g _{i} ( x ) )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

f^{'} (x) + μ i \sum \frac{t r ( 2 D _{i} g _{i}^{T} ( x ) \partial g _{i} ( x ) )}{\partial x} - \frac{\partial r ( x , λ )}{\partial x} = 0 .

x \in C min f (x) + μ i \sum t r (g_{i}^{T} (x) g_{i} (x) D_{i}) .

x \in C min f (x) + μ i \sum t r (g_{i}^{T} (x) g_{i} (x) D_{i}) .

\frac{p}{2} σ - σ^{\frac{p}{2}} + \frac{2 - p}{2} \geq 0 .

\frac{p}{2} σ - σ^{\frac{p}{2}} + \frac{2 - p}{2} \geq 0 .

f^{'} (σ) = p (1 - σ^{\frac{p - 2}{2}}), and f^{''} (σ) = \frac{p ( 2 - p )}{2} σ_{i}^{\frac{p - 4}{2}} .

f^{'} (σ) = p (1 - σ^{\frac{p - 2}{2}}), and f^{''} (σ) = \frac{p ( 2 - p )}{2} σ_{i}^{\frac{p - 4}{2}} .

t r (\tilde{M} M) \geq t r (ΣΛ) .

t r (\tilde{M} M) \geq t r (ΣΛ) .

t r (\tilde{M}^{\frac{p}{2}}) - \frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) \leq t r (M^{\frac{p}{2}}) - \frac{p}{2} t r (M M^{\frac{p - 2}{2}}) .

t r (\tilde{M}^{\frac{p}{2}}) - \frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) \leq t r (M^{\frac{p}{2}}) - \frac{p}{2} t r (M M^{\frac{p - 2}{2}}) .

\frac{p}{2} σ λ^{\frac{p - 2}{2}} - σ^{\frac{p}{2}} + \frac{2 - p}{2} λ^{\frac{p}{2}} \geq 0 .

\frac{p}{2} σ λ^{\frac{p - 2}{2}} - σ^{\frac{p}{2}} + \frac{2 - p}{2} λ^{\frac{p}{2}} \geq 0 .

\frac{p}{2} t r (Σ Λ^{\frac{p - 2}{2}}) - t r (Σ^{\frac{p}{2}}) + \frac{2 - p}{2} t r (Λ^{\frac{p}{2}}) \geq 0,

\frac{p}{2} t r (Σ Λ^{\frac{p - 2}{2}}) - t r (Σ^{\frac{p}{2}}) + \frac{2 - p}{2} t r (Λ^{\frac{p}{2}}) \geq 0,

\frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) - \frac{p}{2} t r (Σ Λ^{\frac{p - 2}{2}}) \geq 0 .

\frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) - \frac{p}{2} t r (Σ Λ^{\frac{p - 2}{2}}) \geq 0 .

\frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) - t r (Σ^{\frac{p}{2}}) + \frac{2 - p}{2} t r (Λ^{\frac{p}{2}}) \geq 0 .

\frac{p}{2} t r (\tilde{M} M^{\frac{p - 2}{2}}) - t r (Σ^{\frac{p}{2}}) + \frac{2 - p}{2} t r (Λ^{\frac{p}{2}}) \geq 0 .

\begin{array}[]{l}\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})-tr(\tilde{M}^{\frac{p}{2}})+\frac{{2-p}}{2}tr(M^{\frac{p}{2}})\geq 0\\ \Rightarrow tr(\tilde{M}^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})\leq\frac{{2-p}}{2}tr(M^{\frac{p}{2}})\\ \Rightarrow tr(\tilde{M}^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})\leq tr(M^{\frac{p}{2}})-\frac{p}{2}tr(MM^{\frac{{p-2}}{2}}),\\ \end{array}

\begin{array}[]{l}\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})-tr(\tilde{M}^{\frac{p}{2}})+\frac{{2-p}}{2}tr(M^{\frac{p}{2}})\geq 0\\ \Rightarrow tr(\tilde{M}^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})\leq\frac{{2-p}}{2}tr(M^{\frac{p}{2}})\\ \Rightarrow tr(\tilde{M}^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{M}M^{\frac{{p-2}}{2}})\leq tr(M^{\frac{p}{2}})-\frac{p}{2}tr(MM^{\frac{{p-2}}{2}}),\\ \end{array}

\begin{array}[]{l}tr((\tilde{A}^{T}\tilde{A}+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{A}^{T}\tilde{A}(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \leq tr((A^{T}A+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr(A^{T}A(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \end{array}\,.

\begin{array}[]{l}tr((\tilde{A}^{T}\tilde{A}+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr(\tilde{A}^{T}\tilde{A}(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \leq tr((A^{T}A+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr(A^{T}A(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \end{array}\,.

\begin{array}[]{l}tr((\tilde{A}^{T}\tilde{A}+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr((\tilde{A}^{T}\tilde{A}+\delta I)(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \leq tr((A^{T}A+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr((A^{T}A+\delta I)(A^{T}A+\delta I)^{\frac{{p-2}}{2}}),\\ \end{array}

\begin{array}[]{l}tr((\tilde{A}^{T}\tilde{A}+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr((\tilde{A}^{T}\tilde{A}+\delta I)(A^{T}A+\delta I)^{\frac{{p-2}}{2}})\\ \leq tr((A^{T}A+\delta I)^{\frac{p}{2}})-\frac{p}{2}tr((A^{T}A+\delta I)(A^{T}A+\delta I)^{\frac{{p-2}}{2}}),\\ \end{array}

f (\tilde{x}) + μ i \sum t r (g_{i}^{T} (\tilde{x}) g_{i} (\tilde{x}) D_{i}) \leq f (x) + μ i \sum t r (g_{i}^{T} (x) g_{i} (x) D_{i}),

f (\tilde{x}) + μ i \sum t r (g_{i}^{T} (\tilde{x}) g_{i} (\tilde{x}) D_{i}) \leq f (x) + μ i \sum t r (g_{i}^{T} (x) g_{i} (x) D_{i}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Probabilistic and Robust Engineering Design · Advanced Optimization Algorithms Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Feature Selection

Full text

An Iteratively Re-weighted Method for Problems with Sparsity-Inducing Norms

Feiping Nie, Zhanxuan Hu, Xiaoqian Wang, Rong Wang, Xuelong Li, , Heng Huang F. Nie, Z. Hu, R. Wang and X. Li are with the School of Computer Science, OPTIMAL, Northwestern Polytechnical University, Xian 710072, Shaanxi, P. R. China. E-mail: [email protected], [email protected], [email protected], [email protected] X. Wang and H. Huang are with Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA. E-mail: [email protected], [email protected]

Abstract

This work aims at solving the problems with intractable sparsity-inducing norms that are often encountered in various machine learning tasks, such as multi-task learning, subspace clustering, feature selection, robust principal component analysis, and so on. Specifically, an Iteratively Re-Weighted method (IRW) with solid convergence guarantee is provided. We investigate its convergence speed via numerous experiments on real data. Furthermore, in order to validate the practicality of IRW, we use it to solve a concrete robust feature selection model with complicated objective function. The experimental results show that the model coupled with proposed optimization method outperforms alternative methods significantly.

Index Terms:

sparse learning, low rank learning, feature selection.

I Introduction

The problem of regularized risk minimization is encountered in many machine learning fields. It aims to admit a tradeoff between regularizer and loss function as:

[TABLE]

where $f(x)$ denotes the loss function, $g(x)$ denotes the regularizer, and $\mu$ is a regularization parameter balancing these two terms. Generally, the loss function $f(x)$ is relevant to the problem we aim to solve, while the regularizer $g(x)$ depends on the assumption over the structure of $x$ . In practice, the sparsity-inducing norms is a representative example for both $f(x)$ and $g(x)$ , and has been widely used to cope with various machine learning tasks, such as feature selection [1], subspace clustering [2, 3], multi-task learning [4]. We provide a short summarization for representative models in Table I, where the notations of $\|\bullet\|_{0}$ and $\|\bullet\|_{2,0}$ can be found in Section II, and the details over these tasks can be found in the original papers. In addition, note that in this paper we consider $rank(X)$ as a sparsity-inducing norm, as $rank(X)=\sum_{i=1}(\sigma_{i}(X))^{0}$ , where $\{\sigma_{i}(X)\}$ are the singular values of $X$ .

Solving the models with sparsity-inducing norms mentioned in Table I is generally an NP-hard problem, and a general method is looking for some relaxations that solve the original objective functions approximately but efficiently. Most of the relaxations are convex, such as the $\ell_{1}$ -norm, $\ell_{2,1}$ -norm, and nuclear norm. In addition to solid theoretical guarantees, a significant advantage of using convex relaxations is that the involved problems can be solved efficiently by some traditional optimization methods, including Alternating Direction Method of Multipliers (ADMM), Frank-Wolfe (FW) algorithm, proximal algorithm, and stochastic gradient descent (SGD) [5].

In practice, however, convex relaxations often lead to an over-penalized problem [6]. To alleviate this issue, numerous non-convex relaxations have been proposed, such as $\ell_{p}$ (Schatten p)-norm [7], Capped- $\ell_{1}$ -norm [8], Truncated Nuclear Norm [9], MCP [10], SCAD [11]. Although the non-convex relaxations have achieved great success in several practical applications, how to solve the involved problems is still challenging. Concave-Convex Procedure (CCP) [12] is a principled approach to tackle the non-convex problems. Nevertheless, its practicability is generally limited by the high time cost in solving subproblem. Recently, numerous efforts have been made to generalize the proximal algorithm to non-convex problems, such as general iterative shrinkage and thresholding (GIST) [6], Inertial Forward-Backward (IFB) [13], nonmonotone Accelerated Proximal Gradient (nonAPG) [14], and Redistributing Nonconvexity [5]. But, none of them can tackle the case that both loss function and regularizer are non-convex and non-smooth.

The Iteratively Re-Weighted Method (IRW) that we focus in this work has been used in previous studies [18, 19, 20, 7, 21, 22], but all of them aim at solving the model involving only the low rank regularizers. In this work, we generalize IRW to a general problem, where the objective function has multiple Sparsity-Inducing Norms including Schatten p-norm, i.e., the low rank regularizer 111Note that both $\|\bullet\|_{0}$ , $\|\bullet\|_{2,0}$ , $rank(\bullet)$ and their relaxations are named as sparsity-inducing norms, but in this paper we mainly focus on the latter.. The key principle of IRW lies in finding a surrogate function with the following two properties:

•

Convexity and smoothness;

•

The closed-form solution can be solved efficiently.

We show that the original complicated problem can be solved efficiently via iteratively solving the surrogate function. In addition, we provide a solid theoretical analysis for proposed method, and conduct numerous experiments on real data to investigate its convergence speed. In order to further validate the practicality of proposed method, we utilize it to cope with a novel robust feature selection model developed in this paper. Numerous experimental results demonstrate that the model coupled with the proposed optimization method IRW provides a large advantage over alternative algorithms.

II Notations and Definitions

In this paper, we use lowercase letter if it would be scalar, vector or matrix, and use uppercase letter if it is matrix. For matrix $M$ , its $i$ -th row, $j$ -th column and the $ij$ -th entry of ${M}$ are denoted by ${m}^{i}$ , ${m}_{j}$ and $m_{ij}$ , respectively. $tr(M)$ denotes the trace of matrix $M$ .

The $\ell_{r,p}$ -norm of matrix $M$ is defined as

[TABLE]

Particularly, when $r\geq 1$ and $p\geq 1$ , $\ell_{r,p}$ -norm is a valid norm because it satisfies the three norm conditions, including the triangle inequality $\|{A}\|_{r,p}+\|{B}\|_{r,p}\geq\|{A+B}\|_{r,p}$ . When $r<1$ or $p<1$ , the $\ell_{r,p}$ -norm is not a valid norm, the term “norm” here is for convenience. In Eq.(2), when $M$ becomes a column or row vector $m$ , the $\ell_{r,p}$ -norm of $M$ is reduced to the $\ell_{p}$ -norm of $m$ .

The Schatten $p$ -norm of a matrix ${M}$ was defined as

[TABLE]

where $\sigma_{i}$ is the $i$ -th singular value of ${M}$ . When $p\geq 1$ , Schatten $p$ -norm is a valid norm. When $p<1$ , the Schatten $p$ -norm is not a valid norm, the term “norm” here is for convenience.

III Iteratively Reweighted Method for A General Sparse Coding Problem

III-A A General Sparse Coding Problem

In this section, we focus on solving a general problem as follows:

[TABLE]

Note that when $g_{i}(x)$ is scalar, vector or matrix output function, then $tr((g_{i}^{T}(x)g_{i}(x))^{\frac{p}{2}})$ becomes the following terms respectively:

[TABLE]

For the case that $p=1$ , $tr((g_{i}^{T}(x)g_{i}(x))^{\frac{p}{2}})$ denotes the $\ell_{1}$ -norm, $\ell_{2}$ -norm and trace norm respectively,

[TABLE]

For the Eq. (4) is non-smooth, we can turn to solve a approximation problem of it, a smooth problem formulated as follows:

[TABLE]

And, when $\delta\to 0$ , Eq.(7) is reduced to Eq.(4) since the following equations hold:

[TABLE]

Next, we focus on solving the approximation problem (7).

IV Iteratively Reweighted Algorithm for the Approximation Problem

Before deriving the algorithm for optimizing the problem (7), we need some significant lemmas as follows. First, according to the chain rule in calculus, we have Lemma.1.

Lemma 1 (Chain rule)

Suppose $g(x)$ is a matrix output function, $h(x)$ is a scalar output function, $x$ is a scalar, vector or matrix variable, then we have

[TABLE]

According to the chain rule in Lemma 1, we have the following two lemmas:

Lemma 2

Suppose $g(x)$ is a scalar, vector or matrix output function, $x$ is a scalar, vector or matrix variable, then we have

[TABLE]

Proof. Let $h(x)=tr(x^{T}x+\delta I)^{\frac{p}{2}}$ , we have

[TABLE]

further, we can obtain

[TABLE]

According to the chain rule in Lemma 1, we get the Eq.(10). $\Box$

Lemma 3

Suppose $g(x)$ is a scalar, vector or matrix output function, $x$ is a scalar, vector or matrix variable, $D$ is a constant and $D$ is symmetrical if $D$ is a matrix, then we have

[TABLE]

Proof. Let $h(x)=tr(x^{T}xD)$ , we have $\frac{{\partial h(x)}}{{\partial x}}=2xD$ , then we have ${\frac{{\partial h(g(x))}}{{\partial g(x)}}}=2g(x)D$ . So according to the chain rule in Lemma 1, we get the Eq.(13). $\Box$

IV-A Algorithm Derivation

Now we derive the algorithm for optimizing the problem (7). The Lagrangian function of the problem (7) is

[TABLE]

where $r(x,\lambda)$ is a Lagrangian term for the constraint $x\in\mathcal{C}$ . By setting the derivative of Eq.(14) w.r.t. $x$ to zero, we have

[TABLE]

According to Lemma 2, Eq.(15) can be rewritten as

[TABLE]

If we can find a solution $x$ that satisfies the Eq.(16), then we usually find a stationary point or global optimal solution to the problem (7) according to the Karush-Kuhn-Tucker conditions. However, directly finding a solution $x$ that satisfies Eq.(16) is generally not an easy task. In this paper, we propose an iterative algorithm to find it. A basic observation is that, if $D_{i}=\frac{p}{2}(g_{i}^{T}(x)g_{i}(x)+\delta I)^{\frac{{p-2}}{2}}$ is a given constant, then Eq.(16) is reduced to

[TABLE]

which is equivalent to solving the following problem:

[TABLE]

Based on the observation, we first guess a solution $x$ , then we calculate $D_{i}$ based on the current solution $x$ and then update the current solution $x$ by the optimal solution of the problem (18) based on the calculated $D_{i}$ . We iteratively perform this procedure until it converges. The detailed algorithm is described in Algorithm 1. We will give a theoretical analysis to prove the convergence of the proposed algorithm.

IV-B Convergence Analysis of Algorithm 1

Before proving the convergence of the Algorithm 1, we first introduce several significant lemmas.

Lemma 4

For any $\sigma>0$ , the following inequality holds when $0<p\leq 2$ :

[TABLE]

Proof. Denote $f(\sigma)=p\sigma-2\sigma^{\frac{p}{2}}+2-p$ , we have the following derivatives:

[TABLE]

Obviously, when $\sigma>0$ and $0<p\leq 2$ , then $f^{\prime\prime}(\sigma)\geq 0$ and $\sigma=1$ is the only point that $f^{\prime}(\sigma)=0$ . Note that $f(1)=0$ , thus when $\sigma>0$ and $0<p\leq 2$ , then $f(\sigma)\geq 0$ , which indicates Eq.(19). $\Box$

Lemma 5 ([23])

For any positive definite matrices $\tilde{M},M$ with the same size, suppose the eigen-decomposition $\tilde{M}=U\Sigma U^{T}$ , $M=V\Lambda V^{T}$ , where the eigenvalues in $\Sigma$ is in increasing order and the eigenvalues in $\Lambda$ is in decreasing order. Then the following inequality holds:

[TABLE]

Lemma 6

For any positive definite matrices $\tilde{M},M$ with the same size, the following inequality holds when $0<p\leq 2$ .

[TABLE]

Proof. For any $\sigma>0$ , $\lambda>0$ and $0<p\leq 2$ , according to Lemma 4 we have $\frac{p}{2}(\frac{\sigma}{\lambda})-(\frac{\sigma}{\lambda})^{\frac{p}{2}}+\frac{{2-p}}{2}\geq 0$ , which indicates

[TABLE]

Suppose the eigen-decomposition $\tilde{M}=U\Sigma U^{T}$ , $M=V\Lambda V^{T}$ , where the eigenvalues in $\Sigma$ is in increasing order and the eigenvalues in $\Lambda$ is in decreasing order. Then, according to Eq.(22), we have

[TABLE]

and according to Lemma 5 we have

[TABLE]

and

[TABLE]

Note that $tr(\tilde{M}^{\frac{p}{2}})=tr(\Sigma^{\frac{p}{2}})$ and $tr(M^{\frac{p}{2}})=tr(\Lambda^{\frac{p}{2}})$ , so we have

[TABLE]

which completes the proof. $\Box$

Lemma 7

For any matrices $\tilde{A},A$ with the same size and $\delta>0$ , the following inequality holds when $0<p\leq 2$ .

[TABLE]

Proof. Note that $\tilde{A}^{T}\tilde{A}+\delta I$ and $A^{T}A+\delta I$ are positive definite matrices since $\delta>0$ . Then according to Lemma 6 we have

[TABLE]

which indicates Eq.(26). $\Box$

As a result, we have the following theorem.

Theorem 1

The Algorithm 1 will monotonically decrease the objective of the problem (7) in each iteration until the algorithm converges.

Proof: In step 2 of Algorithm 1, suppose the updated $x$ is $\tilde{x}$ . According to step 2, we know

[TABLE]

where the equality holds when and only when the algorithm converges.

For each $i$ , according to Lemma 7, we have

[TABLE]

Note that $D_{i}=\frac{p}{2}(g_{i}^{T}(x)g_{i}(x)+\delta I)^{\frac{{p-2}}{2}}$ , so for each $i$ we have

[TABLE]

Then we have

[TABLE]

Summing Eq. (28) and Eq. (31) in the two sides, we arrive at

[TABLE]

Note that the equality in Eq.(32) holds only when the algorithm converges. Thus the Algorithm 1 will monotonically decrease the objective of the problem (7) in each iteration until the algorithm converges. $\Box$

In the convergence, the equality in Eq. (16) will hold, thus the KKT condition [24] of problem (7) is satisfied. Therefore, the Algorithm 1 will usually converge to a stationary point to the problem (7). If the problem (7) is convex, the Algorithm 1 will usually converge to a global optimum solution.

IV-C An Example Problem

In this subsection, we give a concrete example and show how to derive the optimization algorithm for the example problem based on Algorithm 1. The problem is:

[TABLE]

This problem is a special case of problem (7). According to Algorithm 1, we only need to solve the following problem for each $i$ in each iteration:

[TABLE]

where $D_{1}^{i}$ is a diagonal matrix, the $k$ -th diagonal element is $a^{k}x_{i}-y_{ki}$ , $D_{2}$ is a diagonal matrix, the $k$ -th diagonal element is $\frac{p}{2}((b^{k}X-z^{k})^{T}(b^{k}X-z^{k})+\delta)^{\frac{{p-2}}{2}}$ , $D_{3}=\frac{p}{2}(XX^{T}+\delta I)^{\frac{{p-2}}{2}}$ .

Taking the derivative of problem (34) w.r.t. $x_{i}$ and setting it to zero, we have

[TABLE]

The detailed algorithm for solving the problem (33) is listed in Algorithm 2.

V Iteratively Reweighted Method for A More General Problem

In this section, we focus on generalizing the problem (4) to a more general problem as follows:

[TABLE]

where $h_{i}(x)$ is an arbitrary concave and differentiable function. Inspired by the Algorithm 1, the algorithm to solve the problem (36) is shown in Algorithm 3, where we denote $\frac{{\partial h_{i}(g_{i}^{T}(x)g_{i}(x)+\delta I)}}{{\partial(g_{i}^{T}(x)g_{i}(x)+\delta I)}}=h^{\prime}_{i}(g_{i}^{T}(x)g_{i}(x)+\delta I)$ . We will analyze the convergence of the algorithm in the next subsection.

For example, it can be easily checked that $h(M)=(tr(M))^{\frac{p}{2}}$ , $h(M)=(tr(M))^{\frac{p}{2}}$ is concave and differentiable when $0<p\leq 2$ . So the Algorithm 3 can be applied to solve the following problem:

[TABLE]

V-A Convergence Analysis of Algorithm 3

Lemma 8

For an arbitrary concave and differentiable function $h(x)$ , the following inequality holds:

[TABLE]

Then we have the following theorem.

Theorem 2

The Algorithm 1 will monotonically decrease the objective of the problem (36) in each iteration until the algorithm converges.

Proof: In step 2 of Algorithm 3, suppose the updated $x$ is $\tilde{x}$ . According to step 2, we know

[TABLE]

where the equality holds when and only when the algorithm converges.

Since $h_{i}(x)$ is concave for each $i$ , according to Lemma 8, we have

[TABLE]

Note that $D_{i}=h^{\prime}_{i}(g_{i}^{T}(x)g_{i}(x)+\delta I)$ , so for each $i$ we have

[TABLE]

and then

[TABLE]

So we have

[TABLE]

Summing Eq. (38) and Eq. (42) in the two sides, we arrive at

[TABLE]

Note that the equality in Eq.(32) holds only when the algorithm converges. Thus the Algorithm 1 will monotonically decrease the objective of the problem (7) in each iteration until the algorithm converges. $\Box$

Lemma 9

Suppose $g(x)$ is a scalar, vector or matrix output function, $x$ is a scalar, vector or matrix variable, then we have

[TABLE]

Proof: According to the chain rule in Lemma 1, we have

[TABLE]

which completes the proof. $\Box$

Theorem 3

The Algorithm 3 will converge to the KKT condition of the problem (36).

Proof: The Lagrangian function of the problem (36) is

[TABLE]

Based on the KKT condition, by setting the derivative of $L_{1}(x,\lambda)$ w.r.t. $x$ , we have

[TABLE]

According to Lemma (9), Eq.(46) can be rewritten as

[TABLE]

On the other hand, in the second step of the Algorithm 3, we solve the problem $\mathop{\min}\limits_{x\in\mathcal{C}}f(x)+\sum\limits_{i}{tr(g_{i}^{T}(x)g_{i}(x)D_{i})}$ . The Lagrangian function of this problem is

[TABLE]

By setting the derivative of $\mathcal{L}_{2}(x,\lambda)$ w.r.t. $x$ , we have

[TABLE]

According to Lemma (3), Eq.(49) can be rewritten as

[TABLE]

Thus we find a solution satisfying Eq.(50) in each iteration according to the second step of Algorithm 3. In the convergence of the Algorithm 3, note that $D_{i}=h^{\prime}_{i}(g_{i}^{T}(x)g_{i}(x)+\delta I)$ according to the first step of Algorithm 3, Eq.(50) is equivalent to

[TABLE]

Therefore, the solution $x$ satisfies Eq.(51) in the convergence of the Algorithm 3, which is exactly the same as the KKT condition of the problem (36) in Eq.(47). $\Box$

Theorem 2 and 3 indicate that the Algorithm 3 will converge, and usually converge to a stationary point to the problem (36). If the problem (36) is convex, the Algorithm 1 will usually converge to a global optimum solution.

It is worth to pointing out that the similar algorithm and results can also be found for the following general problem:

[TABLE]

where $h_{i}(x)$ is an arbitrary concave and differentiable function. In this case, the two steps in Algorithm 3 becomes $D_{i}=h^{\prime}_{i}(g_{i}(x))$ and $\mathop{\min}\limits_{x\in\mathcal{C}}f(x)+\sum\limits_{i}{tr((g_{i}(x))^{T}D_{i})}$ , respectively.

VI Experimental Results

In this section, we will conduct diversified experiments to empirically demonstrate the convergence rate as well as the computing accuracy of our new algorithm.

VI-A Data Description

A total of $16$ publicly reachable real benchmark data sets have participated in our evaluations, including: AR10P, PIX10P, PIE10P [[25]], ORL10P [[26]], ALLAML [[27]], MLLML [[28]], LUNG [[29]], Prostate-GE [[30]], Carcinomas [[31], [32]], GLIOMA [[33]], CLL-SUB-111 [[34]], TOX-171 [[35]], SMK-CAN-187 [[36]], Prostate-MS [[37]], ARCENE and DBWorld, among which the first four are face image data sets222Downloaded from http://featureselection.asu.edu/datasets.php, the next eleven are gene expression data sets, while the last two are life data sets achieved from the UCI Repository[[38]]. Detailed property of these 17 data sets is introduced as below.

AR10P data set records $130$ face images from $10$ different people, with each person contributing $13$ images to the data set. The faces are represented by $60*40$ pixel images, thus the dimensionality of each sample is $2400$ . This data set has participated in numerous face recognition experiments.

PIX10P data set consists of $100$ face images from $5$ male and $5$ female people. For each participant, $10$ face images with the dimensionality of $100*100$ are included. This is also a famous data set utilized in face recognition simulation.

PIE10P data set is collected by the Robotics Institute of Carnegie Mellon University. It is composed of 210 face images from 10 different people, with 21 faces from each testee. Each face is depicted by a 55*44 image. Similar to the previous two data sets, this data set also shows up in face recognition experiments with high frequency.

ORL10P data set is also known as the ”AT&T face data sets”. It collects 400 face images from 40 distinct subjects. All images are of the size 92*112 pixels, with 256 grey levels per pixel. In our experiments, we use the selected data from the ASU Feature Selection Database where all 10 classes without glasses are included.

ALLAML data set records 7129 genes (sequences) information from the Affymetrix 6800 chip. It has a total of 72 samples in two classes, ALL and AML, of 47 and 25 samples, respectively.

MLLML data set is downloaded from Liubjana A.I. lab website, which contains a subset of human acute lymphoblastic leukemias with a chromosomal translocation involving the mixed-lineage leukemia gene. As is shown in the data set, the mixed-lineage leukemia (MLL) gene has a clear pattern to be separated from ALL and AML, thus this data set has been widely utilized in classification experiments. This data set is composed of 72 samples from three classes, which are ALL, AML and MLL. The number of samples of these three classes are 24, 28 and 20, respectively. Each sample has 12582 genes.

LUNG data set provides a source for the study of lung cancer. It has 203 samples in five classes, among which there are 139 adenocarcinoma (AD), 17 normal lung (NL), 6 small cell lung cancer (SMCL), 21 squamous cell carcinoma (SQ) as well as 20 pulmonary carcinoid (COID) samples. Each sample has 3312 genes.

Prostate-GE data set records gene information of both prostate cancer and normal patients. It contains 102 samples of two classes, among which there are 52 tumor samples and 50 normal samples, respectively. In our experiment, each sample has 5966 genes.

Carcinomas data set shows the influence of genes on various types of carcinomas. This data set contains 174 samples of 11 classes, which are 26 samples of prostate carcinoma, 8 samples of bladder/ureter carcinoma, 26 samples of breast carcinoma, 23 samples of colorectal carcinoma, 12 samples of gastroesophagus carcinoma, 11 samples of kidney carcinoma, 7 samples of liver carcinoma, 27 samples of ovary carcinoma, 6 samples of pancreas carcinoma, 14 samples of lung adeno-carcinoma and 14 samples of lung squamous cell carcinoma. Each sample contains 9182 genes as features.

GLIOMA data set encompasses 50 samples of four different disease statuses, where there are 14 cancer glioblastomas (CG), 14 noncancer glioblastomas (NG), 7 cancer oligodendrogliomas (CO) and 15 non-cancer oligodendrogliomas (NO) samples, respectively. Each sample is described by 4433 genes.

CLL-SUB-111 data set composes of microarray gene expression information of 111 samples from 3 classes. It provides array analysis results on chronic lymphocytic leukemia (CLL) patients. Each sample contains 11340 genes.

TOX-171 data set records blood analysis of acute Dengue virus (DENV) patients, which provides references for molecular mechanisms studies of DENV infection. This data set consists of 171 samples from 4 classes, where each sample have 5748 features.

SMK-CAN-187 data set provides insights into the study of lung cancer inducement. It records RNA microarray information from 187 samples of two classes, including Bronchial Epithelium of Smokers with Lung Cancer and those without. Each sample has a total of 19993 features.

Prostate-MS data set contains a total of 332 samples from three different classes, which are 69 samples diagnosed as prostate cancer, 190 samples of benign prostate hyperplasia, as well as 63 normal samples showing no evidence of disease. Each sample has 15154 genes.

ARCENE data set provides mass-spectrometric information for both cancer and normal patterns. The size of this data set is 100, where each sample has a total of 10000 attributes. It provides challenge for two-class classification with continuous input data.

DBWorld contains 64 emails manually collected from DBWorld mailing list. These emails are classified in two classes: announces of conferences and everything else. Each email is depicted by 4702 features in bag-of-words representation.

VI-B Experiments on Solving the Example Problem (33)

In this experiment, we examine the efficiency of our algorithm for solving Problem (33) with different values of $p$ .

There are four data sets required as input in this experiment, that is, $A$ and $Y$ , $B$ and $Z$ , where $A$ and $B$ , $Y$ and $Z$ are required to have the same dimensionality. However, it’s tough to find real benchmark data sets with exactly the same dimensionality. But experiments on purely synthetic data lack challenges to some extent. Hence we decide to combine benchmark data sets with synthetic data. The data we utilized for matrix $A$ and $Y$ are real benchmark data sets, while for matrix $B$ and $Z$ are synthetic data obeying Gaussian distribution, whose dimensionality are set to be the same with the corresponding real benchmark data.

Since the purpose of this experiment is to show the convergence performance of our method with different $p$ values, here we choose seven disparate $p$ values in the range of $0<p\leq 2$ which are {0.1, 0.5, 0.8, 1, 1.2, 1.5, 2}. We performed experiments on eight data sets with comparatively small dimensionality, which are AR10P, PIE10P, ALLAML, LUNG, Prostate-GE, Carcinomas, GLIOMA, and TOX-171.

The results are displayed in Fig. 1, from which we know that our methods converge very fast, usually within $50$ iterations. Especially when $p$ = 2, our method converges in just one iteration. That’s because when $p$ = 2, Problem (33) becomes:

[TABLE]

In Problem (53), if we take derivative w.r.t. $X$ and set it to zero, we will get:

[TABLE]

where $X$ has a closed form solution, thus our method converges in just one iteration.

VI-C Experiments on the Proximal Problem

This experimental subsection talks about solving another complex problem as below:

[TABLE]

According to the series of work by Yurii Nesterov [39, 40, 41, 42], we can solve Problem (55) via the proximal method. Before directly going to the solving process, let’s first have a brief introduction on the proximal method.

For a general minimization problem w.r.t. $x$ as follows:

[TABLE]

We can obtain an approximate equality of function $f(x)$ according to its Taylor series:

[TABLE]

where $L=f^{\prime\prime}({x_{t-1}})$ .

Then the original equation in Problem (56) can be rewritten as:

[TABLE]

thus we can update $x_{t}$ in each iteration as the optimal solution to the following problem:

[TABLE]

It has been proven in Yurii Nesterov’s work that if the original problem is convex, the proximal method will reach its global optimum with a convergence rate $O(\frac{1}{t})$ ; otherwise it will end up with a stationary point.

Based on the proximal method introduced above, we can optimize Problem (55) by solving the following problem in each iteration:

[TABLE]

where $V={X_{t-1}}-\frac{1}{L}f^{\prime}({X_{t-1}})$ , $\gamma^{\prime}_{1}=\frac{2\gamma_{1}}{L}$ , $\gamma^{\prime}_{2}=\frac{2\gamma_{2}}{L}$ and $\gamma^{\prime}_{3}=\frac{2\gamma_{3}}{L}$ .

It’s apparent that Problem (59) can solved using our new algorithm. So in this subsection our goal is to check the efficiency of our algorithm for solving problem (59).

In this experiment we utilized all the eight data sets used in Sect. VI-B, and varied $p$ in the set $\{0.1,0.5,0.8,1,1.2,1.5,2\}$ . As for the parameter $\gamma^{\prime}_{1}$ , $\gamma^{\prime}_{2}$ and $\gamma^{\prime}_{3}$ , here we simply set them to be 1 as we are just devoted to testing the convergence rate in this experiment. If the purpose is instead to minimize Problem (59) and find a suitable $X$ that best accomplishes a certain task, tuning $\gamma^{\prime}_{1}$ , $\gamma^{\prime}_{2}$ and $\gamma^{\prime}_{3}$ provides a convenient way for improving the performance. We present the results on disparate data sets in Fig. 2.

Obviously, our methods converge very fast, almost all within $20$ iterations. And we also witness a special case where $p$ = 2, that our method converges in just one iteration. The reason is similar to above, i.e., when $p$ = 2, problem (59) has a closed form solution, which urges our algorithm to converge in merely one iteration.

VI-D Experiments on a Robust Feature Selection Problem

In this subsection, we applied our method to a robust feature selection problem. We utilized our algorithm to solve the following question so as to find an appropriate weight matrix $W$ with which we can accomplish efficient feature selection.

[TABLE]

In the above function, $\circ$ is the Hadamard product, and M is defined as $M=max(({X^{T}}W+1{b^{T}}-Y)\circ Y,~{}0)$ . By means of the matrix $M$ , we implemented a classifier where all positive loss brought by those correctly classified points was discarded. This trick brought more robustness to our method than other related ridge regression classification methods since a correctly classified points won’t generate loss in the objective.

Before showing experimental results, we first briefly summarize the solving process of Problem (60). Resorting to Algorithm 1, Problem (60) can be rewritten as an easily solvable form as below:

[TABLE]

where $D_{1}$ is a diagonal matrix with the $k$ -th diagonal element to be $\frac{1}{2}(({X^{T}}w^{k}+({b^{T}}-y^{k}-y^{k}\circ m^{k}))^{T}({X^{T}}w^{k}+({b^{T}}-y^{k}-y^{k}\circ m^{k}))+\delta)^{-\frac{1}{2}}$ and $D_{2}$ is a diagonal matrix with the $k$ -th diagonal element to be $\frac{p}{2}((w^{k})^{T}w^{k}+\delta)^{\frac{p-2}{2}}$ .

Similarly, we can solve problem (61) via the alternative optimization method. Taking derivative w.r.t. $w_{i}$ in problem (61) and we get:

[TABLE]

That is,

[TABLE]

Taking derivative w.r.t. $b$ in problem (61) and we get:

[TABLE]

The optimal solution of $b$ is:

[TABLE]

For this problem, we update variables $W$ , $D$ , $b$ and $M$ alternatively and iteratively until convergence.

In this test, We applied all $16$ data sets to this experiment. For each data set evaluated in the experiment, we employed the 5-fold cross validation, which randomly selected 80% of the data for training and the remaining $20\%$ for testing. We utilized the SVM classifier with linear kernel and let $C=1$ . The number of features selected ranges from 10 to 100, with the incremental step to be 10. We compared our proposed feature selection method with several popularly used feature selection methods: Fisher Score [43], Information Gain [44], ReliefF [45, 46], T-test and ChiSquare [47].

For our method, we collected the performance for four different $p$ values, which are $\{0.1,0.3,0.5,1\}$ . We didn’t use $p$ values larger than $1$ since we want to guarantee the sparsity of the $W$ matrix. Also, as we have validated in the previous two experiments, our method converges very fast, hence we set the number of iteration times to be $30$ . The evaluation of different methods is based on the average classification accuracy, which is summarized in Fig. 3 and Fig. 4.

Observing the Fig. 3 and Fig. 4, we are confirmed with the effectiveness of our proposed method on real benchmark data sets. Our method generally has a high potential to outperform other traditional methods on these various kinds of data sets. No matter what the $p$ value is, our method always gains its superiority. Also, the efficiency of our method has been discussed and demonstrated previously. All in all, our algorithm is capable of finding a promising classification matrix, which is more robust to outliers and finishes with guaranteed speed.

VII Conclusions

Loss function and regularizer are two significant factors influencing the performance of an algorithm. And, for each of them the Sparsity-Inducing Norms is generally involved. In order to solve the complicated problems with Sparsity-Inducing Norms, in this work we provide a simple yet efficient optimization method, which can cope with the case that both loss function and regularizer are non-convex. The proposed method is suitable for various tasks.

Two issues of IRW are:1) theoretically, only stationary points are provided for non-convex problems; 2) practically, IRW is not efficient for problems with multiple inseparable variables, for the closer-form solution cannot be directly obtained when the surrogate function has multiple inseparable variables. Solving these two issues is the focus in our future work.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China grant under numbers 61772427 and 61751202.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint ℓ 2 , 1 subscript ℓ 2 1 \ell_{2,1} -norms minimization,” in Advances in neural information processing systems , 2010, pp. 1813–1821.
2[2] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition . IEEE, 2009, pp. 2790–2797.
3[3] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 1, pp. 171–184, 2013.
4[4] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Machine Learning , vol. 73, no. 3, pp. 243–272, 2008.
5[5] Q. Yao and J. T. Kwok, “Efficient learning with a family of nonconvex regularizers by redistributing nonconvexity.” Journal of Machine Learning Research , vol. 18, pp. 179–1, 2017.
6[6] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye, “A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems,” in International Conference on Machine Learning , 2013, pp. 37–45.
7[7] K. Mohan and M. Fazel, “Iterative reweighted algorithms for matrix rank minimization,” Journal of Machine Learning Research , vol. 13, no. Nov, pp. 3441–3473, 2012.
8[8] F. Nie, Z. Huo, and H. Huang, “Joint capped norms minimization for robust matrix recovery,” in The 26th International Joint Conference on Artificial Intelligence (IJCAI 2017) , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

An Iteratively Re-weighted Method for Problems with Sparsity-Inducing Norms

Abstract

Index Terms:

I Introduction

II Notations and Definitions

III Iteratively Reweighted Method for A General Sparse Coding Problem

III-A A General Sparse Coding Problem

IV Iteratively Reweighted Algorithm for the Approximation Problem

Lemma 1** (Chain rule)**

Lemma 2

Lemma 3

IV-A Algorithm Derivation

IV-B Convergence Analysis of Algorithm 1

Lemma 4

Lemma 5** ([23])**

Lemma 6

Lemma 7

Theorem 1

IV-C An Example Problem

V Iteratively Reweighted Method for A More General Problem

V-A Convergence Analysis of Algorithm 3

Lemma 8

Theorem 2

Lemma 9

Theorem 3

VI Experimental Results

VI-A Data Description

VI-B *Experiments on Solving the Example Problem (33) *

VI-C Experiments on the Proximal Problem

VI-D Experiments on a Robust Feature Selection Problem

VII Conclusions

Acknowledgment

Lemma 1 (Chain rule)

Lemma 5 ([23])

VI-B Experiments on Solving the Example Problem (33)