Adaptive Locality Preserving Regression

Jie Wen; Zuofeng Zhong; Zheng Zhang; Lunke Fei; Zhihui Lai; Runze Chen

arXiv:1901.00563·cs.CV·January 4, 2019

Adaptive Locality Preserving Regression

Jie Wen, Zuofeng Zhong, Zheng Zhang, Lunke Fei, Zhihui Lai, Runze Chen

PDF

TL;DR

This paper introduces ALPR, a flexible discriminative regression method that preserves data structure, performs feature selection, and enhances interpretability for improved classification accuracy.

Contribution

It proposes a novel adaptive target learning and locality preserving constraint with l21 norm regularization for feature selection and interpretability.

Findings

01

Effective in synthetic and real-world datasets

02

Preserves data structure and enhances discriminative power

03

Selects important features and reduces noise

Abstract

This paper proposes a novel discriminative regression method, called adaptive locality preserving regression (ALPR) for classification. In particular, ALPR aims to learn a more flexible and discriminative projection that not only preserves the intrinsic structure of data, but also possesses the properties of feature selection and interpretability. To this end, we introduce a target learning technique to adaptively learn a more discriminative and flexible target matrix rather than the pre-defined strict zero-one label matrix for regression. Then a locality preserving constraint regularized by the adaptive learned weights is further introduced to guide the projection learning, which is beneficial to learn a more discriminative projection and avoid overfitting. Moreover, we replace the conventional `Frobenius norm' with the special l21 norm to constrain the projection, which enables the…

Tables7

Table 1. TABLE I: Classification accuracies (%) of different methods on the three-ring data. (Note: NC is the abbreviation of the nearest neighbor classify [ 54 ] .

No.	NC	LRC	CRC	SRC	LRLR	LRRR	SLRR	DLSR	ReLSR	SVM	DRLS	MSRL	CLRS	GReLSR	ALPR
Th1	93.13	33.00	33.33	39.40	36.13	35.67	36.13	63.60	63.40	99.20	77.33	66.60	36.07	34.06	99.93
Th2	38.33	41.46	33.33	42.27	31.00	30.93	31.00	41.26	68.13	67.80	37.93	69.27	32.00	33.53	99.87

Table 2. TABLE II: Descriptions of the used real-world databases.

Database	# Sample per class	# Class	# Feature
COIL100	72	100	1024
PIE	164-170	68	1024
LFW	11-20	86	1024
Scene_SPM	210-410	15	3000
CIFAR-10	6000	10	1000

Table 3. TABLE III: Mean classification accuracies (%) of different methods on the COIL100 database. Note: (1) bold numbers denote the best results; (2) we directly list the results of MSRL reported in [ 22 ] .

No.	LRC	CRC	SRC	SVM	LRLR	LRRR	SLRR	DLSR	ReLSR	DRLS	MSRL	CLRS	GReLSR	ALPR
10	82.77	77.80	84.30	83.99	55.78	65.86	58.81	82.47	81.66	77.52	88.40	79.29	78.99	87.69
15	88.82	82.31	85.07	89.04	61.04	69.61	64.99	87.55	86.17	81.44	93.32	83.35	83.50	91.72
20	91.82	84.89	87.86	92.12	67.26	72.08	69.04	92.57	89.11	88.15	95.87	85.85	86.23	94.37
25	93.64	86.61	90.67	93.89	72.22	74.73	73.01	93.28	93.23	90.06	97.15	87.96	88.28	95.97

Table 4. TABLE IV: Mean classification accuracies (%) of different methods on the PIE database. Note: (1) bold numbers denote the best results; (2) we directly list the results of MSRL reported in [ 22 ] .

No.	LRC	CRC	SRC	SVM	LRLR	LRRR	SLRR	DLSR	ReLSR	DRLS	MSRL	CLRS	GReLSR	ALPR
10	75.16	86.33	72.48	77.87	73.06	86.67	86.88	82.55	87.53	84.70	89.51	89.58	87.22	91.14
15	84.60	90.86	82.62	86.44	80.26	89.99	90.25	89.34	91.89	89.40	93.39	92.93	91.43	94.49
20	89.62	92.98	85.66	92.65	82.27	91.96	92.57	92.28	93.89	92.32	95.02	94.50	93.48	96.00
25	91.87	93.94	89.90	93.74	88.22	93.45	93.88	94.16	95.19	93.82	95.96	95.39	94.71	96.67

Table 5. TABLE V: Mean classification accuracies (%) of different methods on the LFW database. Note: bold numbers denote the best results.

No.	LRC	CRC	SRC	SVM	LRLR	LRRR	SLRR	DLSR	ReLSR	DRLS	MSRL	CLRS	GReLSR	ALPR
5	29.73	30.12	29.38	26.04	30.24	33.37	30.57	27.90	31.81	26.26	32.34	36.91	37.31	37.39
6	32.18	31.44	32.51	29.52	33.29	35.24	34.15	30.80	34.45	28.07	35.68	40.48	40.10	41.39
7	34.53	32.51	33.64	30.60	34.96	35.59	34.36	33.73	37.70	33.97	38.45	41.91	42.72	43.27
8	37.23	34.55	35.12	33.14	35.59	36.52	35.64	36.80	40.37	34.52	42.58	44.39	44.55	45.93

Table 6. TABLE VI: Mean classification accuracies (%) of different methods on the Scene_SPM database. Note: bold numbers denote the best results.

No.	LRC	CRC	SRC	SVM	LRLR	LRRR	SLRR	DLSR	ReLSR	DRLS	MSRL	CLRS	GReLSR	ALPR
10	87.75	87.64	87.60	85.09	81.08	86.02	84.44	87.77	88.04	86.98	88.86	89.41	89.84	90.86
20	92.21	92.02	91.99	91.30	89.49	88.24	89.53	91.49	92.04	93.53	93.60	94.06	94.21	95.25
30	93.64	94.02	92.89	92.90	86.59	87.72	89.75	93.50	93.36	94.70	95.44	95.75	95.83	96.66
40	94.97	94.64	95.49	93.43	91.38	90.34	91.07	94.22	95.79	95.21	96.52	96.71	96.90	97.62

Table 7. TABLE VII: Classification accuracies (ACC) (%) of different methods on the K-means-CIFAR10 database. For the four deep learning based methods, we directly list their reported results.

Method	ACC	Method	ACC	Method	ACC
LRC	58.87	DLSR	67.15	GReLSR	70.49
CRC	56.35	ReLSR	64.82	ALPR	72.37
SRC	54.67	SVM	71.41	ResNet	93.57
LRLR	65.14	DRLS	66.95	SFC	92.19
LRRR	65.21	MSRL	70.83	DeepLDA	92.71
SLRR	65.14	CLRS	70.12	DensetNet	94.81

Equations50

W min Y - X^{T} W_{F}^{2} + λ ∥ W ∥_{F}^{2}

W min Y - X^{T} W_{F}^{2} + λ ∥ W ∥_{F}^{2}

W, T min T - W^{T} X_{F}^{2} + λ ∥ W ∥_{F}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1

W, T min T - W^{T} X_{F}^{2} + λ ∥ W ∥_{F}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1

W min \frac{1}{2} Y - X^{T} W_{F}^{2} + \frac{1}{2} W^{T} X (η L_{w} - (1 - η) L_{b}) X^{T} W

W min \frac{1}{2} Y - X^{T} W_{F}^{2} + \frac{1}{2} W^{T} X (η L_{w} - (1 - η) L_{b}) X^{T} W

\small W_{i,j}^{w}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{X_{:,j}}\in{N_{w}}\left({{X_{:,i}}}\right){\kern 1.0pt}{\kern 1.0pt}or{\kern 1.0pt}{\kern 1.0pt}{X_{:,i}}\in{N_{w}}\left({{X_{:,j}}}\right){\kern 1.0pt}{\kern 1.0pt}}\\ {0,}&{otherwise}\end{array}}\right.

\small W_{i,j}^{w}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{X_{:,j}}\in{N_{w}}\left({{X_{:,i}}}\right){\kern 1.0pt}{\kern 1.0pt}or{\kern 1.0pt}{\kern 1.0pt}{X_{:,i}}\in{N_{w}}\left({{X_{:,j}}}\right){\kern 1.0pt}{\kern 1.0pt}}\\ {0,}&{otherwise}\end{array}}\right.

\small W_{i,j}^{b}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{X_{:,j}}\in{N_{b}}\left({{X_{:,i}}}\right){\kern 1.0pt}{\kern 1.0pt}or{\kern 1.0pt}{\kern 1.0pt}{X_{:,i}}\in{N_{b}}\left({{X_{:,j}}}\right){\kern 1.0pt}{\kern 1.0pt}}\\ {0,}&{otherwise}\end{array}}\right.

\small W_{i,j}^{b}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{X_{:,j}}\in{N_{b}}\left({{X_{:,i}}}\right){\kern 1.0pt}{\kern 1.0pt}or{\kern 1.0pt}{\kern 1.0pt}{X_{:,i}}\in{N_{b}}\left({{X_{:,j}}}\right){\kern 1.0pt}{\kern 1.0pt}}\\ {0,}&{otherwise}\end{array}}\right.

W, S min λ_{1} i = 1 \sum C n_{i} j, k = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} ∥ W^{T} x_{j}^{i} - W^{T} x_{k}^{i} ∥_{2}^{2} + ∥ Y - X^{T} W ∥_{F}^{2} s . t . k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0,

W, S min λ_{1} i = 1 \sum C n_{i} j, k = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} ∥ W^{T} x_{j}^{i} - W^{T} x_{k}^{i} ∥_{2}^{2} + ∥ Y - X^{T} W ∥_{F}^{2} s . t . k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0,

W, S, T min ∥ T - X^{T} W ∥_{F}^{2} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} ∥ W^{T} x_{j}^{i} - W^{T} x_{k}^{i} ∥_{2}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0

W, S, T min ∥ T - X^{T} W ∥_{F}^{2} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} ∥ W^{T} x_{j}^{i} - W^{T} x_{k}^{i} ∥_{2}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0

W, S, T min T - X^{T} W_{F}^{2} + λ_{2} ∥ W ∥_{2, 1} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, k = 1, k \neq = j \sum n_{i} S_{j k}^{i} = 1, S_{j k}^{i} \geq 0

W, S, T min T - X^{T} W_{F}^{2} + λ_{2} ∥ W ∥_{2, 1} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2} s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, k = 1, k \neq = j \sum n_{i} S_{j k}^{i} = 1, S_{j k}^{i} \geq 0

L (W) = T - X^{T} W_{F}^{2} + λ_{2} ∥ W ∥_{2, 1} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2}

L (W) = T - X^{T} W_{F}^{2} + λ_{2} ∥ W ∥_{2, 1} + λ_{1} i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2}

L (W) = T - X^{T} W_{F}^{2} + λ_{1} T r (W^{T} S_{W} W) + λ_{2} ∥ W ∥_{2, 1}

L (W) = T - X^{T} W_{F}^{2} + λ_{1} T r (W^{T} S_{W} W) + λ_{2} ∥ W ∥_{2, 1}

X (X^{T} W - T) + λ_{1} S_{W} W + \frac{λ _{2}}{2} D W = 0 \Rightarrow W = (X X^{T} + λ_{1} S_{W} + \frac{λ _{2}}{2} D)^{- 1} X T

X (X^{T} W - T) + λ_{1} S_{W} W + \frac{λ _{2}}{2} D W = 0 \Rightarrow W = (X X^{T} + λ_{1} S_{W} + \frac{λ _{2}}{2} D)^{- 1} X T

S min i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2} s . t . k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0

S min i = 1 \sum C n_{i} k, j = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2} s . t . k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0

k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0 min k = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2}

k = 1, k \neq = j \sum n_{i} S_{j, k}^{i} = 1, S_{j, k}^{i} \geq 0 min k = 1, k \neq = j \sum n_{i} S_{j, k}^{i}^{2} W^{T} x_{j}^{i} - W^{T} x_{k}^{i}_{2}^{2}

S_{j, k}^{i} = \frac{1}{W ^{T} x _{j}^{i} - W ^{T} x _{k}^{i} _{2}^{2}} (p = 1, p \neq = j \sum n_{i} \frac{1}{W ^{T} x _{j}^{i} - W ^{T} x _{p}^{i} _{2}^{2}})^{- 1}

S_{j, k}^{i} = \frac{1}{W ^{T} x _{j}^{i} - W ^{T} x _{k}^{i} _{2}^{2}} (p = 1, p \neq = j \sum n_{i} \frac{1}{W ^{T} x _{j}^{i} - W ^{T} x _{p}^{i} _{2}^{2}})^{- 1}

T_{i, l_{i}} - j \neq = l_{i} m a x T_{i, j} \geq 1 min T - X^{T} W_{F}^{2}

T_{i, l_{i}} - j \neq = l_{i} m a x T_{i, j} \geq 1 min T - X^{T} W_{F}^{2}

{t_{i}}=\left\{{\begin{array}[]{*{20}{c}}{{g_{i}}+\Delta,}&{if{\kern 1.0pt}{\kern 1.0pt}i=h}\\ {{g_{i}}+\min\left({\Delta-{v_{i}},0}\right),}&{otherwise}\end{array}}\right.

{t_{i}}=\left\{{\begin{array}[]{*{20}{c}}{{g_{i}}+\Delta,}&{if{\kern 1.0pt}{\kern 1.0pt}i=h}\\ {{g_{i}}+\min\left({\Delta-{v_{i}},0}\right),}&{otherwise}\end{array}}\right.

T_{i, l_{i}} - j \neq = l_{i} m a x T_{i, j} \geq 1 min ∥ T_{i, :} - G_{i, :} ∥_{2}^{2}

T_{i, l_{i}} - j \neq = l_{i} m a x T_{i, j} \geq 1 min ∥ T_{i, :} - G_{i, :} ∥_{2}^{2}

L (W^{t + 1}, S^{t}, T^{t}) \leq L (W^{t}, S^{t}, T^{t})

L (W^{t + 1}, S^{t}, T^{t}) \leq L (W^{t}, S^{t}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t}) \leq L (W^{t + 1}, S^{t}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t}) \leq L (W^{t + 1}, S^{t}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t + 1}) \leq L (W^{t + 1}, S^{t + 1}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t + 1}) \leq L (W^{t + 1}, S^{t + 1}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t + 1}) \leq L (W^{t}, S^{t}, T^{t})

L (W^{t + 1}, S^{t + 1}, T^{t + 1}) \leq L (W^{t}, S^{t}, T^{t})

W, A, B, T, P min T - W^{T} X_{F}^{2} + α ∥ W ∥_{F}^{2} + β ∥ W - A B ∥_{F}^{2} + λ i, j \sum n (W^{T} x_{i} - W^{T} x_{j}_{2}^{2} P_{i, j} + σ P_{i, j}^{2}) s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, A^{T} A = I, 0 \leq P_{i, j} \leq 1, P 1 = 1

W, A, B, T, P min T - W^{T} X_{F}^{2} + α ∥ W ∥_{F}^{2} + β ∥ W - A B ∥_{F}^{2} + λ i, j \sum n (W^{T} x_{i} - W^{T} x_{j}_{2}^{2} P_{i, j} + σ P_{i, j}^{2}) s . t . T_{i, l_{i}} - j \neq = l_{i} max T_{i, j} \geq 1, A^{T} A = I, 0 \leq P_{i, j} \leq 1, P 1 = 1

r ank (W) \leq s min Y - X^{T} W_{F}^{2} + λ ∥ W ∥_{2, 1}

r ank (W) \leq s min Y - X^{T} W_{F}^{2} + λ ∥ W ∥_{2, 1}

W, M \geq 0 min Y - B ⊙ M - X^{T} W_{2, 1} + λ ∥ W ∥_{2, 1}

W, M \geq 0 min Y - B ⊙ M - X^{T} W_{2, 1} + λ ∥ W ∥_{2, 1}

{B_{i,j}}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{l_{i}}=j}\\ {-1,}&{otherwise}\end{array}}\right.

{B_{i,j}}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}{l_{i}}=j}\\ {-1,}&{otherwise}\end{array}}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsInterpretability

Full text

Adaptive Locality Preserving Regression

Jie Wen, Zuofeng Zhong, Zheng Zhang, Lunke Fei∗, Zhihui Lai, Runze Chen This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61702110, 61702117 and 61703169, and in part by Technology Program of Guangzhou under Grant no. 201804010355. (Jie Wen and Zuofeng Zhong are co-first authors with equal contributions.) (Corresponding author: Lunke Fei.)Jie Wen and Lunke Fei are with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, Guangdong, China. (Email: [email protected]; [email protected])Zuofeng Zhong is with the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518055, Guangdong, China, and is also with the Institute of Textiles and Clothing, The Hong Kong Polytechnic University, Hong Kong. (Email: [email protected])Zheng Zhang is with the School of Information Technology & Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia. (Email: [email protected])Zhihui Lai is with the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518055, Guangdong, China. (Email: [email protected])Runze Chen is with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China. (Email: [email protected])

Abstract

This paper proposes a novel discriminative regression method, called adaptive locality preserving regression (ALPR) for classification. In particular, ALPR aims to learn a more flexible and discriminative projection that not only preserves the intrinsic structure of data, but also possesses the properties of feature selection and interpretability. To this end, we introduce a target learning technique to adaptively learn a more discriminative and flexible target matrix rather than the pre-defined strict zero-one label matrix for regression. Then a locality preserving constraint regularized by the adaptive learned weights is further introduced to guide the projection learning, which is beneficial to learn a more discriminative projection and avoid overfitting. Moreover, we replace the conventional ‘Frobenius norm’ with the special $l_{2,1}$ norm to constrain the projection, which enables the method to adaptively select the most important features from the original high-dimensional data for feature extraction. In this way, the negative influence of the redundant features and noises residing in the original data can be greatly eliminated. Besides, the proposed method has good interpretability for features owning to the row-sparsity property of the $l_{2,1}$ norm. Extensive experiments conducted on the synthetic database with manifold structure and many real-world databases prove the effectiveness of the proposed method.

Index Terms:

Linear regression, projection learning, adaptive locality preserving, supervised graph regularization.

I Introduction

Regression analysis focuses on estimating the relationships among the dependent variables and independent variables, which has aroused much attention in fields of machine learning [1, 2, 3, 4, 5, 6, 7, 8]. For supervised classification, one of the major tasks is to learn a proper mapping that precisely transforms the training data into their labels. To this end, various regression analysis methods have been proposed over the past decades, such as the ridge regression [9, 10], partial least squares [11], modified minimum squared error [12], and least square regularized regression [13, 14], etc. Besides, many kernel based regression methods, such as kernel ridge regression [15] and support vector regression [16], have also been proposed for the non-separable cases.

Most of the conventional methods prefer to exploit the pre-defined zero-one label matrix as the regression target. However, this simple label matrix is not the optimal discriminative target for supervised classification [17, 18, 19, 20, 21, 22]. First, it limits the flexibility of projection learning because it is too strict. Second, it cannot push samples of different classes far away because distances of the correct and incorrect label vectors are constant (i.e., $\sqrt{2}$ ) in the target space. To solve the problem, many researchers proposed to learn a more discriminative regression target rather than used the strict zero-one label matrix for regression analysis. For example, in [17], the $\varepsilon$ -dragging technique is introduced to adaptively enlarge the distance between the correct and incorrect classes. In [18], a sparse error term is also introduced to relax the strict label matrix, which in turn improves the flexibility of regression analysis. Zhang et al. proposed a very novel target learning technique, in which the target matrix is adaptively learned with large margins between different classes during the regression and projection learning [23].

In most cases, these improved regression based methods can learn a more discriminative projection and obtain a better performance. However, for data with manifold structure or noises, these methods will fail. This is mainly because that (1) these methods only focus on minimizing the regression errors while ignoring the intrinsic geometric structure of data, which may destroy the structure of the original data and lead to overfitting; (2) All features including the important features and noises are treated equally in these methods, which cannot guarantee these methods to obtain a clean projection.

In recent years, many researchers have also discovered the first issue and have made many efforts to address it. For instance, the low-rank linear regression (LRLR) tries to uncover the low-rank structure hidden in the high-dimensional data for regression [24]. In [18], a novel inter-class sparse constraint is introduced, which tries to guarantee the common structure with respect to each class. Besides, integrating the manifold learning into the regression framework is a well-received approach to avoid overfitting [25, 26]. Xue et al. constructed two graphs according to the label information and local nearest neighbor information to guide the projection learning in regression analysis [25]. In [26], a strict label based graph is introduced to regularize the projection, which allows samples of the same class to be pulled together. Compared with the first two mentioned methods, those manifold learning based methods can preserve the local structure of data better. However, these manifold based methods still have the following shortcomings: (1) all graphs exploited to guide the projection learning are constructed independently with the regression in advance, which cannot guarantee the global optimal projection. (2) These methods are sensitive to noise. When data contain noises, the constructed graph will be incorrect. In this case, it is obviously impossible to learn a discriminative projection with the guiding of the incorrect graph. (3) These methods cannot preserve the same nearest neighbor ranks as the original data since they exploit the same weight, i.e., 1, to regularize all nearest neighbors while ignoring the differences of similarity degree among these nearest neighbor pairs. In other words, they cannot preserve the intrinsic nearest neighbor structure of data.

In this paper, we propose a novel and effective method to solve the above problems and learn a more discriminative projection for classification. Specially, the proposed method introduces a novel graph regularization term into the regression framework, in which the graph is adaptively learned in a supervised style and then in turn guides the projection learning. In this way, the proposed method has the potential to learn the global optimal projection that not only can fit the label well, but also can preserve the intrinsic nearest neighbor structure of each class. To improve the flexibility of projection learning, the retargeted learning technique is introduced to our model. Most importantly, a row-sparsity norm instead of the conventional ‘Frobenius’ norm is introduced to constrain the projection, which enables the method to reduce the negative influence of the redundant features and noises. By artfully integrating the above terms into one regression framework, the proposed method is encouraged to perform better. In summary, our work has the following contributions:

(1) We propose a novel regression framework that integrates the adaptive locality preserving, feature selection, and discriminative target learning. Compared with the other methods, the proposed method can learn a more reasonable and discriminative projection and avoid the overfitting problem.

(2) We introduce a novel supervised graph learning and embedding constraint, which can discover the intrinsic local geometric structures of data and adaptively learn the similarity weights to nearest neighbor pairs. By exploiting the reliable local geometric information to regularize the projection, the proposed method has the potential to preserve the intrinsic nearest neighbor structure of data.

(3) The proposed method can adaptively select the most discriminative features for regression by introducing a row-sparsity constraint. This allows the method to reduce the negative influence of noise and improves the interpretability of projection.

This paper is an extended work of our conference paper [27]. Compared with our previous version in [27], (1) we add more experiments and analyses to prove its effectiveness and convergence; (2) We give deep analyses to the computational complexity, convergence, and demonstrate its superior properties through theoretically comparing some related methods; (3) We analyze the parameter selection in detail; (4) Some Theorems and propositions are provided for readers to better understand our paper; (5) A figure has also been added to show our method.

The paper is organized as follows: In Section II, some notations and several related works are briefly described. In Section III, we mainly present the proposed method and its optimization processes. Section IV analyzes the proposed method in depth. Section V conducts several experiments to prove the effectiveness of the proposed method. Section VI offers the conclusion of the paper.

II Related work

II-A Notations

For convenience, some notations used through the paper are briefly described in this section. In our paper, matrix and vector are denoted by the uppercase letter (e.g. $X$ ) and lowercase letter (e.g. $x$ ), respectively. For a matrix $X$ , we use $X_{i,j}$ to denote its $i$ th row and $j$ th column element, and use $X_{i.:}$ and $X_{:,j}$ to represent its $i$ th row vector and $j$ th column vector, respectively. Some typical norms of matrix $X\in R^{m\times n}$ , such as the ‘Frobenius norm’ (i.e., $||X||_{F}$ ), nuclear norm (i.e., $||X||_{*}$ ), $l_{1}$ norm, and $l_{2,1}$ norm, are defined as: ${\left\|X\right\|_{F}}=\sqrt{\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{n}{X_{i,j}^{2}}}}$ , ${\left\|X\right\|_{*}}=\sum\nolimits_{i}{\left|{{\delta_{i}}}\right|}$ , ${\left\|X\right\|_{1}}=\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{n}{\left|{{X_{i,j}}}\right|}}$ , and ${\left\|X\right\|_{2,1}}=\sum\limits_{i=1}^{m}{\sqrt{\sum\limits_{j=1}^{n}{X_{i,j}^{2}}}}$ , respectively, where $\delta_{i}$ is the $i$ th singular value of matrix $X$ [28, 29, 30, 31]. For a vector $x$ with $m$ elements, its $l_{2}$ norm is defined as ${\left\|x\right\|_{2}}=\sqrt{\sum\limits_{i=1}^{m}{{(x_{i})}^{2}}}$ , where $x_{i}$ denotes its $i$ th element. The trace operation of matrix is denoted by $Tr(\cdot)$ . $I$ denotes the identity matrix. 1 is a column vector, where all elements are 1. $X^{-1}$ and $X^{T}$ are the inverse matrix and transposed matrix of $X$ [32], respectively.

II-B Linear regression and retargeted least square regression

Linear regression (LR) is one of the most popular supervised classification methods in fields of machine learning. The objective function of LR is generally formulated as follows [33, 34]:

[TABLE]

where matrix $X\in R^{m\times n}$ denotes the training set, where each column vector represents a sample, $m$ and $n$ denote the feature dimension and number of training samples, respectively. $Y\in R^{n\times C}$ is the label matrix, in which the $i$ th row vector represents the label of the $i$ th sample in the training set $X$ , $C$ is the class number of the training set. In the conventional LR, label matrix $Y$ is generally defined as a special zero-one matrix according to the class information of samples as follows: if the $i$ th sample comes from the $j$ th class, then only $Y_{i,j}=1$ , and all the other elements $Y_{i,k}=0,k\neq j$ . $\lambda$ is a penalty parameter. $W$ is the transformation (or projection) for label prediction. When the projection $W$ is obtained by solving (1), the class label of any test sample $y\in{R^{m\times 1}}$ can be predicted via $k=\mathop{\arg\max}\limits_{i}{\left({{y^{T}W}}\right)_{i}}$ , where ${\left({{y^{T}W}}\right)_{i}}$ is the $i$ th element of vector $(y^{T}W)$ .

In [23], Zhang et al. pointed out that using the strict zero-one label matrix as the regression target is harmful to classification and limits the flexibility in the discriminative projection learning. To address this issue, Zhang et al. proposed the retargeted least square regression (ReLSR), which seeks to jointly learn a more discriminative and flexible regression target and projection in one framework as follows:

[TABLE]

where $l_{i}$ is the true class index of the $i$ th sample.

We can find that in ReLSR, the margins of the correct and incorrect classes are all larger than 1, which encourages it to increase the separability of data in the target space. Moreover, the adaptively learned target matrix provides more flexibility to learn the discriminative projection.

II-C Discriminatively regularized least-squares

To pull samples of the same class closer and push samples of different classes far away as much as possible, Xue et al. [25] proposed a graph regularized method, named discriminatively regularized least-squares (DRLS). DRLS explores the underlying geometric knowledge of the original data to guide the projection learning of linear regression, in which two graphs, i.e., intra-class graph $W^{w}$ and inter-class graph $W^{b}$ , are pre-constructed and regularized on the projection. The objective function of DRLS is formulated as follows:

[TABLE]

where $\eta$ ( $\eta\in[0,1]$ ) is a penalty parameter, $L_{w}$ and $L_{b}$ are the Laplacian matrices of graph $W^{w}$ and $W^{b}$ , respectively. For a non-negative graph $S$ , its Laplacian matrix is defined as $L=D-\frac{S+S^{T}}{2}$ , where matrix $D$ is a diagonal matrix with the $i$ th diagonal element as ${D_{i,i}}=\sum\nolimits_{j=1}{{\frac{S_{i,j}+S_{j,i}}{2}}}$ [25]. In DLSR, the intra-class graph $W^{w}$ and inter-class graph $W^{b}$ are respectively constructed as follows according to the nearest neighbor information and label information of data:

[TABLE]

where ${N_{w}}\left({{X_{:,i}}}\right)$ denotes the sample set which is composed of the nearest neighbors with the same class to sample $X_{:,i}$ , ${N_{b}}\left({{X_{:,i}}}\right)$ denotes the sample set which is composed of the nearest neighbors with different classes to sample $X_{:,i}$ .

III The proposed method

As we all know, the original data contains much useful information, such as the given label information and the underlying geometric information residing in the data [35, 36, 37, 38]. If we can appropriately utilize these information to guide the projection learning, then a better classification performance will be obtained. This illustrates that how to well discover and explore these information is crucial to learn a more compact and discriminative projection for accurately separating these samples. Based on this motivation, we propose a simple but effective method, named adaptive locality preserving regression (ALPR), to learn a more discriminative and compact projection for classification. Fig.1 shows the flowchart of the proposed method.

III-A Model of the proposed method

Generally, if two samples are nearest neighbors in the original data space, they will be more possible from the same class. Naturally, we also expect that these nearest neighbor relationships can be still preserved in the subspace [39]. To this end, many methods have been proposed, such as the locality preserving projections [40] and anchorgraph-based locality preserving projection [41], etc. Besides, in the branch of LR, many improved methods have also been proposed by introducing the nearest neighbor information [25, 26]. For instance, DRLS expect the nearest neighbor samples with the same class label to be pulled closer by embedding two graphs [25]. Large amount experiments proved that introducing the nearest neighbor information has the potential to further improve the classification performance. However, there are two issues in these conventional LR methods. First, they generally utilize the pre-constructed graph to guide the projection learning, which cannot guarantee the global optimal projection for regression. Second, they ignore the differences of similar degree among these nearest neighbor samples with the same class label since all nearest neighbor pairs are regularized with the same weight (e.g. 1). This will destroy the original nearest neighbor order in each local space. For example, suppose $x_{2}$ and $x_{3}$ are the $1$ th and $2$ nd nearest neighbors of sample $x_{1}$ , respectively. For the conventional graph regularized methods, their transformed samples, i.e., $W^{T}x_{2}$ and $W^{T}x_{3}$ may not still be the $1$ th and $2$ nd nearest neighbors of sample $W^{T}x_{1}$ in the discriminant subspace since their regularized weights are the same. Therefore, using the fixed weight to regularize all nearest neighbor pairs is not appropriate. A more reasonable approach is to assign different weights for different nearest neighbor pairs according to their similar degrees. In other words, it is better to give a relatively larger weight to the most nearest neighbor pairs. To this end, we introduce the following graph regularization term to explore the geometric information of data:

[TABLE]

where $X\in R^{m\times n}$ and $Y\in R^{n\times C}$ are the given training set and corresponding label matrix, $C$ is the class number, $n_{i}$ is the sample number of the $i$ th class. $x_{j}^{i}$ and $x_{k}^{i}$ denote the $j$ th and $k$ th sample of the $i$ th class, respectively. Accordingly, $S_{j,k}^{i}$ is the weight to regularize the distance of the corresponding samples $x_{j}^{i}$ and $x_{k}^{i}$ in the discriminant subspace. $\lambda_{1}$ is the penalty parameter.

We can find that by introducing the constraint ${\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\sum\limits_{k=1,k\neq j}^{{n_{i}}}{S_{j,k}^{i}}=1,S_{j,k}^{i}\geq 0$ , model (6) treats all classes equally in preserving their own nearest neighbor structures and avoids the trivial solution to weight graph $S$ [42]. In addition, all weights are adaptively learned from the latent discriminant subspace rather than the complex original space, which enables the method to capture the intrinsic nearest neighbor relationships of samples.

As analyzed in the previous section and proved in many references [23, 34, 17, 22], exploiting a more flexible target matrix with large margins between the incorrect and correct classes is very beneficial to improve the discriminability of the projection. Owing to the flexibility and simplicity of the retargeted learning approach [23], we exploit it to replace the strict zero-one target matrix and rewrite our model as follows:

[TABLE]

where $T\in R^{n\times C}$ is the target matrix, $l_{i}\in\{1,...,C\}$ is the true label index of the $i$ th sample.

In most cases, data acquired in the real-world applications usually have high dimensionality and many redundant features even noises. These redundant features or noises residing in the original data are useless even harmful to model training. To address the issue, we further impose a row-sparsity norm constraint on the projection and as a result the final model is:

[TABLE]

where $\lambda_{2}$ is also a penalty parameter.

Proposition 1: Introducing the constraint ${\lambda_{2}}{\left\|W\right\|_{2,1}}$ allows the learned projection $W$ to simultaneously perform feature selection and feature extraction.

Proof: As presented in Section II, $\left\|W\right\|_{2,1}$ is defined as ${\left\|W\right\|_{2,1}}=\sum\limits_{i=1}^{m}{\sqrt{\sum\limits_{j=1}^{C}{W_{i,j}^{2}}}}$ , which is obviously equivalent to the $l_{1}$ norm constraint ${\left\|a\right\|_{1}}=\sum\limits_{i=1}^{m}{{a_{i}}}$ , where ${a_{i}}={\left\|{{W_{i,:}}}\right\|_{2}}=\sqrt{\sum\limits_{j=1}^{C}{W_{i,j}^{2}}}$ . According to the theory of sparse representation, minimizing the optimization problem constrained by the $l_{1}$ norm will enforce some elements of the corresponding vector to zero [17, 43]. In other words, some elements of vector $a$ will be enforced to zero. Accordingly, if element $a_{i}$ is enforced to zero, we can deduce that all elements corresponding to the $i$ th row of matrix $W$ will be assigned as zero because ${\left\|{{W_{i,:}}}\right\|_{2}}=\sqrt{W_{i,1}^{2}+\ldots+W_{i,C}^{2}}=0$ . This demonstrates that minimizing problem (8) will adaptively enforce some rows of the projection $W$ to zero. According to the basic rule of matrix multiplication, if all elements of the $i$ th row of projection $W$ are zero, then the corresponding $i$ th feature of any sample will make no contribution to the generation of the new features during the linear combination. That is to say, these features corresponding to the rows of matrix $W$ with all zero values are not selected during the feature extraction. So we conclude that introducing the sparse constraint ${\lambda_{2}}{\left\|W\right\|_{2,1}}$ to the model allows the method to earn the feature selection property. Thus we complete the proof.

Proposition 1 allows the method to select the important features for classification and eliminate the negative influence of noises or redundant features. Besides, imposing the $l_{2,1}$ norm on the projection can improve its interpretability [44].

III-B Solution to the proposed method

It is obvious that problem (8) has no analytical solution since it contains three variables, i.e., $W,S,T$ , in one problem. In this section, we provide an efficient iterative algorithm to obtain their local optimal solutions as follows.

Step 1: Calculate variable $W$ . For convenience ,we define

[TABLE]

Problem (9) can be simplified as follows:

[TABLE]

where ${S_{W}}=\sum\limits_{i=1}^{C}{{n_{i}}{\sum\limits_{k,j=1,k\neq j}^{{n_{i}}}{{S{{{}_{j,k}^{i}}^{2}}}\left({x_{j}^{i}-x_{k}^{i}}\right){{\left({x_{j}^{i}-x_{k}^{i}}\right)}^{T}}}}}$ . Then we can obtain the optimal $W$ by setting the derivative of $L(W)$ with respect to $W$ to zero as follows:

[TABLE]

where $D\in R^{m\times m}$ is a diagonal matrix and its each diagonal element ${D_{i,i}}={1\mathord{\left/{\vphantom{1{\sqrt{\sum\limits_{j=1}^{C}{w_{i,j}^{2}}}}}}\right.\kern-1.2pt}{\sqrt{\sum\limits_{j=1}^{C}{W_{i,j}^{2}}}}}$ [17].

Step 2: Calculate variable $S$ . Fixing variables $T$ and $W$ , variable $S$ can be obtained by solving the following problem:

[TABLE]

Problem (12) can be further simplified into the following subproblems:

[TABLE]

Theorem 1 [42]. Given a positive vector $b\in R^{1\times n}$ ( $\forall i$ ( $1\leq i\leq n$ ), ${b_{i}}>0$ ), problem $\mathop{\min}\limits_{\sum\limits_{i=1}^{n}{{a_{i}}}=1,{a_{i}}\geq 0}\sum\limits_{i=1}^{n}{a_{i}^{2}{b_{i}}}$ has the optimal solution as ${a_{i}}=\frac{1}{{{b_{i}}}}{\left({\sum\limits_{k=1}^{n}{\frac{1}{{{b_{k}}}}}}\right)^{-1}}$ .

Proof. The detailed proof process is moved to the Appendix A in the supplementary material.

According to Theorem 1, we can obtain the optimal solution to problem (13):

[TABLE]

Step 3: Calculate variable $T$ . Fixing variables $W$ and $S$ , the subproblem to variable $T$ is as follows:

[TABLE]

Problem (15) is a typical constrained quadratic programming problem [45].

Theorem 2 [23]. For any given vector $g=\left[{{g_{1}},{g_{2}},\ldots,{g_{n}}}\right]$ , the optimal solution of problem $\mathop{\min}\limits_{{t_{h}}-\mathop{\max}\limits_{i\neq h}{t_{i}}}\left\|{t-g}\right\|_{2}^{2}$ is

[TABLE]

where ${v_{i}}=1+{g_{i}}-{g_{h}}$ , $\Delta=\frac{{\sum\limits_{i\neq h}{{v_{i}}\Upsilon\left({\Gamma^{\prime}\left({{v_{i}}}\right),0}\right)}}}{{1+\sum\limits_{i\neq h}{\Upsilon\left({\Gamma^{\prime}\left({{v_{i}}}\right),0}\right)}}}$ . $\Upsilon\left(\cdot\right)$ is a logic function and has the following defination $\Upsilon\left({a,b}\right)=\left\{{\begin{array}[]{*{20}{c}}{1,}&{if{\kern 1.0pt}{\kern 1.0pt}a>b}\\ {0,}&{otherwise}\end{array}}\right.$ . Function $\Gamma^{\prime}\left(x\right)=2(x+\sum\limits_{i\neq h}{\min\left({x-{v_{i}},0}\right)})$ .

Proof: Please refer to Appendix B in supplementary material for the completed proof process.

It is obvious that problem (15) can be decomposed into the following subproblems

[TABLE]

According to Theorem 2, we can fast obtain the optimal solution $T_{i,:}$ of problem (17). By computing all rows of $T$ separately according to (17) and Theorem 2, the optimal solution $T$ can be finally obtained.

Algorithm 1 summarizes the completed optimization steps of our method.

III-C Classification based on ALPR

After obtaining the projection $W$ , LR based methods transform all the original data into the discriminant subspace via $W^{T}X$ for the subsequent classification. It is undoubted that each projected sample can be viewed as its new representation, which is obtained by the weighted linear combination of projection $W$ and all original features of the corresponding sample. However, this simple weighted combination approach may magnify the negative influence of noises or redundant features residing in the high-dimensional data. Thanks to the feature selection property of the used $l_{2,1}$ norm constraint, our method has the potential to discover the discriminability of different features. As analyzed in Proposition 1, by imposing $l_{2,1}$ norm on the projection, some rows of projection $W$ corresponding to the unimportant features will be adaptively assigned very small values even 0. Meanwhile, these rows also have very small $l_{2}$ norm values. This illustrates that the discriminability of each feature can be commonly measured by the $l_{2}$ norm of the corresponding row of projection $W$ [17]. Thus we can enforce those rows to zero directly to further eliminate the negative influence of noises and redundant features. Inspired by this motivation, we provide the following approach listed in Algorithm 2 to implement the classification, in which a threshold value $\rho$ is set to select those discriminative features.

IV Analysis of the proposed method

IV-A Computational complexity

In this section, we mainly analyze the computational complexity of the proposed optimization algorithm listed in Algorithm 1. For simplicity, we do not consider the computational complexities of some simple matrix operations, such as matrix addition, subtraction, multiplication, and element-wise based matrix division since these operations can be efficiently computed. Overall, there are three main variables, i.e., $W$ , $S$ , and $T$ , to calculate from Algorithm 1. In Step 1, i.e., updating variable $W$ , the major computational cost is the matrix inverse operation which has the computational complexity of $O\left({{m^{3}}}\right)$ for an $m\times m$ matrix [46]. Thus the total computational complexity of Step 1 is about $O\left({{m^{3}}}\right)$ . For Steps 2 and 3, it is obvious that these two steps only contain some simple matrix operations and thus their computational complexities can be ignored. In summary, the total computational complexity of the optimization approach listed in Algorithm 1 is about $O\left({\tau{m^{3}}}\right)$ , where $\tau$ is the iteration number.

IV-B Convergence analysis

From the previous presentation, all variables can be simply calculated with the closed form solutions in their own subproblems. Meanwhile, we can prove the following proposition.

Proposition 2: For the optimization problem (8), each subproblem is convex with respect to variables $W,S,T$ , respectively.

Proof: From (9), the subproblem to variable $W$ is a typical sparse constraint optimization problem. Because the sparse ${l_{2,1}}$ norm is a convex function [17], thus problem (9) is a convex optimization subproblem. For variable $S$ , it is obvious that all constraints with respect to variable $S$ are convex, and thus subproblem (12) is also convex [32, 33]. The subproblem (16) with respect to variable $T$ is also a typical convex problem, i.e., convex constraint quadratic programming problem [34]. Therefore, all subproblems with respect to variables $W,S,T$ are convex, respectively. Thus we complete the proof.

Based on the proposition 2, we can derive the following Theorem 3 to the presented Algorithm 1 [19, 32].

Theorem 3: The optimization approach presented in Algorithm 1 monotonically decreases the objective value of problem (8).

Proof: Let $L\left({{W^{t}},{S^{t}},{T^{t}}}\right)$ be the objective function value of problem (8) at the $t$ th iteration. At the $\left({t+1}\right)$ th iteration, the optimal solution of ${W^{t+1}}$ is first calculated by solving the subproblem (9). Because subproblem (9) is convex, thus we have the following inequation after this iteration step

[TABLE]

Similarly, owing to the convex property of subproblems to variables $W$ and $T$ , we can obtain the following inequations after the corresponding iterations:

[TABLE]

Combing Eqs. (22), (23) and (24), we can obtain

[TABLE]

Therefore, we complete the proof.

Theorem 3 provides some assurances to the convergence property of the proposed method. Since the objective function (8) is lower bounded, the proposed method will finally find a local optimal solution that makes the objective function value converge [47]. In the subsequent section, we will further prove its convergence property based on some experiments.

IV-C Connections to other methods

In this section, we analyze the connections and differences between the proposed method and some related LR methods, such as ReLSR [23], MSRL [22], DLSR [17] and SLRR [24], etc.

(1) Connections to ReLSR and MSRL: The discriminative regression model of MSRL is as follows:

[TABLE]

From the models of MSRL in (22) and ReLSR in (2), we can find that the proposed method and MSRL are the two extensions of ReLSR indeed. By introducing some valuable constraints to the model, MSRL and the proposed method can learn a more compact and discriminative projection than ReLSR. Although there are some similar points between MSRL and the proposed method, they still have many differences as follows. (i) MSRL exploits an unsupervised graph regularization term to guide the projection learning. While in our method, the graph regularization term is supervised. For MSRL, the unsupervised graph regularization term is not perfect and has many shortcomings. For example, in MSRL, samples from different classes may be regarded as the nearest neighbor samples and connected with large weight, which is obviously unreasonable. Besides, MSRL is sensitive to the number of nearest neighbors. Compared with MSRL, our method does not have the above problems since it can construct a more reliable graph by introducing the novel supervised graph learning term. (ii) MSRL imposes ‘Frobenius norm’ on the projection matrix. In our method, the sparse ${l_{2,1}}$ norm is introduced. Compared with MSRL, the proposed method has the feature selection property, which is able to select the most important features for regression and has the potential to reduce the negative influence of the redundant features and noises. (iii) We can find that the proposed method has less penalty parameters than MSRL, which greatly reduces the complexity of parameter selection. In Fig.2, we have plotted the first 100 rows of the projections learned by ReLSR, MSRL, and the proposed method on the PIE database. From Fig.2, it is obvious that some rows of the projection (Fig.2(c)) learned by the proposed method are adaptively enforced to zero while the other two projections do not show this phenomenon. As analyzed in proposition 1, some features of original data corresponding to these rows will not be selected for feature extraction. In Fig.3(a), we have marked these unselected features with ‘black point’ on the original images. From Fig.3(a), we can find that some similar areas among different faces are marked while features of the remaining areas such as eyes, mouth, and nose, etc. are treated as important features. This also proves that the proposed method has the potential to select those important features from the original high-dimensional data. Meanwhile, in Fig.3(b), we have plotted the weight graph of features, in which the weight corresponding to the $i$ th feature is calculated as $||W_{i,:}||_{2}$ [17]. From Fig.3(b), it is easy to discover the importance degrees of different features in classification. This demonstrates that projection learned by our method has good interpretability for features. In summary, the proposed method has many superior properties in comparison with MSRL and ReLSR, which allow it to perform better than the other methods.

(2) Connections to SLRR and DLSR. Similar to the proposed method, SLRR and DLSR all impose the sparse ${l_{2,1}}$ norm constraint on the projection matrix. The objective functions of SLRR and DLSR are respectively formulated as follows

[TABLE]

where $B$ is a constant matrix and defined as follows

[TABLE]

where $l_{i}$ is the true class index of sample $x_{i}$ .

From (23), (24), and (8), the proposed method and DLSR all exploit the target learning technique to enlarge the margins of samples from different classes, thus can learn a more discriminative projection than SLRR. Compared with DLSR, our method not only exploits a more flexible target learning technique, but also introduces an adaptive graph regularization term to further improve the compactness of the projection. These properties encourage the proposed method to learn a more discriminative projection than DLSR and avoid overfitting, and thus can obtain a better performance.

V Experiments and analysis

In this section, we evaluate the proposed method on the synthetic database and five real-world databases. Several related LR based methods, including linear regression classification (LRC) [48], sparse representation based classification (SRC) [49], collaborative representation based classification (CRC) [50], support vector machine (SVM)111We exploit LibSVM toolbox to implement experiments. LibSVM is available at https://www.csie.ntu.edu.tw/ cjlin/libsvm/. [51], LRLR [24], low-rank ridge regression (LRRR) [24], SLRR [24], discriminative least squares regression (DLSR) [17], ReLSR [23], DRLS [25], MSRL [22], constrained least square regression (CLSR) [52], and groupwise retargeted least-squares regression (GReLSR) [53], are chosen to compare with the proposed method. Among these methods, LRRR is an extension of LRLR, which imposes the ‘Frobenius’ norm on the low-rank projection. CLSR and GReLSR can be viewed as the extensions of ReLSR, which mainly introduce the group based label relaxation technique to improve the performance. For the proposed method, the threshold value $\rho$ is set to 0.0001 to all databases.

V-A Experiments and analysis on the synthetic database

Preserving the geometric structure is very important to discriminant analysis, especially for some databases with manifold structure. Following the experimental settings in [42], we synthesize the typical manifold data, i.e., three-ring data with two different amplitudes of the third feature, to prove the effectiveness of the proposed method in dealing with such type of data. For convenience, we refer to the three-ring data with amplitude $\left[{-20,20}\right]$ as Th1, and refer to the other one with amplitude $\left[{-2000,2000}\right]$ as Th2. Fig.4 (a) shows the typical data of the Th1, which contains three features. For Th1 and Th2, the first two features are centrally distributed in the circle style, while the third feature can be viewed as the noise to some extent. In our experiments, each synthesized three-ring data is composed of 3 classes and 1000 samples per class, in which 500 samples are randomly selected from each class to form the training set, and the remaining samples are treated as the test set accordingly.

Table I enumerates the experimental results of different methods including the nearest neighbor classifier (NC) on Th1 and Th2. In Fig.4, we plot the projected test samples and their predicted labels of different methods on Th1, and show the projection learned by our method on Th1. From the experimental results in Table I, it is obvious that many methods except SVM and the proposed method perform worse than the simplest classifier, i.e., NC, on Th1 and Th2. In addition, from Fig.4, we can also find that only the proposed method can simultaneously preserve the intrinsic structure well and obtain the satisfactory classification result. Although LRLR, LRRR, and SLRR can preserve the similar structure as the original data, they cannot predict these test samples correctly. Therefore, these experimental results prove the effectiveness of the proposed method in classifying the data with manifold structure. Moreover, from Table I, we can also find that with the amplitude of the third feature (noise) increasing, the classification accuracies of almost all methods decrease dramatically. And it is obvious that the classification accuracy of our method only decreases about 0.06% when the noise increases. These demonstrate that the proposed method is superior to the other methods for the classification tasks with noises. Furthermore, from Fig.4 (l), we can find that the feature extraction weights corresponding to the third feature (noise) of Th1 are adaptively set as 0. This proves that introducing the row-sparsity norm constraint is valuable, which can effectively reduce the negative influence of noises. In summary, the proposed method is not only suitable to classify the data with manifold structure, but also robust to noise to some extent.

V-B Experiments and analysis on real-world databases

In this subsection, five benchmark databases listed in Table II are chosen to evaluate the effectiveness of our method.

(1) COIL100 object database222COIL100 database is available at: http://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php [55]: COIL100 database is one of the most popular benchmarks for object classification. It is composed of 7200 images provided by 100 objects, in which each object has 72 images with different poses. Fig.5 (a) shows some example images of the database. Each image used in the experiments was normalized and resized to $32\times 32$ with black background in advance. For this database, we randomly select 10, 15, 20, and 25 samples of each class to form the training set and treat the remaining samples as the test set, respectively.

(2) CMU pose, illumination, and expression (PIE) face database [56]: PIE database is one of the challenging databases for face recognition since it contains over 40, 000 images with various poses, illumination conditions, and expressions, etc. In our experiments, we compare different methods on a subset of PIE which totally contains 11554 images of 68 individuals. Each class has nearly 170 samples with 5 different poses. Fig.5 (b) shows some typical images of the database. For computational efficiency, each image was pre-resized to $32\times 32$ and then stacked into a vector with 1024 dimensions. For this database, we randomly select 10, 15, 20, and 25 samples per class as the training set and treat the remaining samples as the test set for experiments, respectively.

(3) Labeled Faces in the Wild (LFW) face database [57]: The LFW database is more challenge than the PIE face database since all images are directly collected from the web with different poses, backgrounds, expressions, illuminations, and image acquirement devices, etc. In our experiments, a subset which contains 1251 cropped face images provided by 86 persons is chosen for comparison [58]. There are about 11-20 samples in each class. Some typical images from the same class are shown in Fig.5 (c). Similarly, each image was transformed into a $32\times 32$ matrix and then stacked into a vector. Then 5, 6, 7, and 8 samples are randomly chosen from each class to form the training set and the reaming samples are regarded as the test set accordingly.

(4) Fifteen Scene Categories (Scene15) database333The Fifteen Scene Categories database is available at: http://www-cvr.ai.uiuc.edu/ponce_grp/data/ [59]: The Scene15 database is widely chosen to evaluate different methods for the scene classification task. The 4485 images are naturally collected from 15 common scenes in daily life, such as street, office, store, highway, living room and kitchen, etc. For each scene, there are about 210-410 natural samples. Fig. 5 (d) shows some typical images of the database. It is not suitable to directly evaluate different methods on the original images because they have many differences in size, intensity, shape, and background, etc. In our experiments, we compare different methods on its feature-level database by following the experimental settings in [33], in which all samples are represented by their spatial pyramid features with 3000 dimensions. We refer to the feature-level database as Scene_SPM for convenience. Similarly, 10, 20, 30, and 40 samples of each class are randomly selected to form the training set and the remaining samples are regarded as the test set, respectively.

(5) CIFAR-10 database [60]: The CIFAR-10 database is a popular large-scale image database, which consists of 50000 training images and 10000 test images from 10 classes. The size of the original color images in the database is $32\times 32$ . Some typical images are shown in Fig.5 (e). In our work, we first exploit k-means based feature extraction method444The feature extraction code of k-means is available at: http://ai.stanford.edu/~acoates/papers/kmeans_demo.tgz [61] to extract the features of CIFAR-10 database, and then utilize the principal component analysis (PCA) [62] algorithm to reduce the feature of each sample to 1000 dimensions to improve the computational efficiency. We refer to the extracted features of CIFAR-10 as K-means-CIFAR10. On this dataset, several well-known deep convolutional network based classification methods, including ResNet with 110 layers [63], simple fast convolutional (SFC) [64], deep linear discriminant analysis (DeepLDA) [65], and DensetNet [66], are also compared.

For the first four databases, we repeatedly perform different methods 20 times and report their mean classification accuracies for fair comparison. For the CIFAR-10 database, we implement all methods on the same 50000 training samples and 10000 test samples. The experimental results of different methods on the above five databases are enumerated in Table III-Table VII. From the experimental results, we can find that our method obtains much better performance than the other methods in most cases. In addition, the following interesting points can be obtained according to the experimental results:

(1) We can find that DLSR, ReLSR, CLRS, and GReLSR generally outperform LRLR, LRRR, and SLRR on the above five databases, which proves the effectiveness of the $\varepsilon$ -dragging technique and discriminative target learning technique. In other words, learning a more flexible target matrix with large margins of different classes is beneficial to learn a more discriminative projection for classification.

(2) From these four tables, it is obvious that MSRL and the proposed method always perform much better than ReLSR. As analyzed in the previous section, MSRL and the proposed method are the two extensions of ReLSR, which exploit the local geometric information of data to guide the projection learning. Therefore, the experimental results prove that preserving the local geometric structure of data during the linear regression is also significant and enables the two methods to learn a more discriminative projection.

(3) The proposed method and MSRL obtain comparative good performance on the PIE and Scene-SPM database. While on the LFW database, the proposed method significantly outperforms MSRL. From Fig.5 (c), we can find that images of the same class also have very large differences in the LFW database. Thus it is difficult to capture the intrinsic geometric structure of data especially using the unsupervised approach. In other words, MSRL cannot find the intrinsic nearest neighbor relationships to guide the projection learning, and thus cannot guarantee the satisfactory performance. Compared with MSRL, our method overcomes this issue by exploiting a supervised approach to adaptively capture the intrinsic similarity relationships among samples of the same class, which plays a positive guiding role in the projection learning. Meanwhile, as analyzed in the previous section, the proposed method has the potential to adaptively select those important features from data for feature extraction and effectively reduce the negative influence of noise, which is also beneficial to improve the classification performance. These two effective approaches encourage the proposed method to obtain a better performance than MSRL on the LFW database.

(4) From Table VII, we can obviously find that all the deep convolutional network based methods achieve much better performance than the conventional methods. This demonstrates that the deep convolutional network based methods can extract more discriminative features than the exploited unsupervised feature extraction method, i.e, k-means. Among all of the conventional methods, the proposed method still outperforms all the other methods, which also proves that the proposed method can learn a more discriminative projection than the other conventional methods.

V-C Parameter sensitivity and selection

Generally, for some methods, selecting the optimal penalty parameters is crucial to achieve satisfactory performance on different databases. In this section, we mainly analyze the sensitivity of parameter selection of the proposed method and then provide a simple strategy to select the optimal parameters. From (8), we can find that the proposed method only contains two penalty parameters, i.e., ${\lambda_{1}}$ and ${\lambda_{2}}$ , which are regularized on the nearest neighbor preserving term and the feature selection term, respectively. To analyze the sensitivity of the classification performance to these two parameters, firstly, a large candidate range $\{{{{10}^{-5}},{{10}^{-4}},{{10}^{-3}},{{10}^{-2}},{{10}^{-1}},1,{{10}^{1}},{{10}^{2}},{{10}^{3}},{{10}^{4}},{{10}^{5}}}\}$ is defined for the two penalty parameters. Secondly, we conduct several experiments to show the relationships of the classification accuracies (%) and different values of the two parameters on the first four databases. Fig.6 shows the classification accuracies versus the two parameters. From these figures, it is obvious that when parameter ${\lambda_{1}}$ is selected from the range of $\left[{{{10}^{-5}},1}\right]$ , and parameter ${\lambda_{2}}$ locates in the proper range, such as $\left[{0.1,1}\right]$ on the COIL100 database, $\left[{0.1,1}\right]$ on the PIE database, $\left[{0.1,1}\right]$ on the LFW database, and $\left[{{{10}^{-5}},1}\right]$ on the Scene_SPM database, respectively, the proposed method can obtain almost constant classification accuracy. This demonstrates that the proposed method is insensitive to the selection of ${\lambda_{1}}$ to some extent.

As far as we know, it is still an open problem to adaptively select the optimal parameters for different databases. In this paper, we exploit a simple approach based on the grid search to find the two optimal parameters [18, 67]. According to the previous analysis, we first define the candidate range $\left[{{{10}^{-5}},1}\right]$ for the two parameters. Then we fix parameter ${\lambda_{1}}$ as 0.1 since the proposed method is insensitive to it, and perform the proposed method with different values of ${\lambda_{2}}$ selected from the coarse candidate range. In this way, we can find the latent optimal ${\lambda_{2}}$ from the candidate range. Then we fix ${\lambda_{2}}$ with the obtained latent optimal value and perform the proposed again to find the optimal value of ${\lambda_{1}}$ from the same candidate range. Finally, we can obtain the best combination of the two parameters for experiments.

V-D Experiments of convergence study

In this section, we mainly conduct some experiments to further prove the convergence property of the proposed optimization approach in Algorithm 1. In Fig. 7, we have plotted the objective function values and classification accuracies versus the iteration steps on the COIL100, PIE, LFW, and Scene_SPM databases, respectively, in which the objective function value is directly calculated as $obj={{\left({\left\|{T-{X^{T}}W}\right\|_{F}^{2}+{\lambda_{1}}Tr\left({{W^{T}}{S_{W}}W}\right)+{\lambda_{2}}{{\left\|W\right\|}_{2,1}}}\right)}\mathord{\left/{\vphantom{{\left({\left\|{T-{X^{T}}W}\right\|_{F}^{2}+{\lambda_{1}}Tr\left({{W^{T}}{S_{W}}W}\right)+{\lambda_{2}}{{\left\|W\right\|}_{2,1}}}\right)}{\left\|X\right\|_{F}^{2}}}}\right.\kern-1.2pt}{\left\|X\right\|_{F}^{2}}}$ according to the objective function (8). From these figures, it is obvious that the objective function value is monotonically decreasing till to the stationary point with the iteration increasing, which proves the point of Theorem 3. Meanwhile, we can also find that the classification accuracy increases obviously until the objective function value converges to the stationary point, which demonstrates that the proposed method can finally find the local optimal solution when the objective function converges.

VI Conclusion

We proposed an effective linear regression method for classification in this paper. The proposed method improves the discriminability of projection through three approaches. Firstly, we adaptively learn a more flexible target matrix with large margins between the correct and incorrect classes for regression. Secondly, we replace the conventional ‘Frobenius norm’ with the sparse $l_{2,1}$ norm to constrain the projection, which enables the proposed method to select the most important features from the original high-dimensional data for feature extraction. Thirdly, we introduce a novel supervised graph regularization term to guide the projection learning. Compared with the conventional unsupervised graph learning approach, the supervised approach presented in our paper is more effective in preserving the intrinsic nearest neighbor relationships of each class. Most importantly, the discriminative target learning, intrinsic graph learning, and projection learning are neatly integrated into one joint learning framework, which enables the method to obtain the global optimal projection for classification so as to obtain a better performance. The effectiveness of the proposed method has been sufficiently proved on the synthetic database and many real-world databases.

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Lai, D. Mo, J. Wen, L. Shen, and W. Wong, “Generalized robust regression for jointly sparse subspace learning,” IEEE Transactions on Circuits and Systems for Video Technology , 2018.
2[2] G.-J. Qi, W. Liu, C. Aggarwal, and T. Huang, “Joint intermodal and intramodal label transfers for extremely rare or unseen classes,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, no. 7, pp. 1360–1373, 2017.
3[3] S. Jiang, M. Lian, C. Lu, Q. Gu, S. Ruan, and X. Xie, “Ensemble prediction algorithm of anomaly monitoring based on big data analysis platform of open-pit mine slope,” Complexity , vol. 2018, pp. 1–13, 2018.
4[4] G.-J. Qi, Q. Tian, and T. Huang, “Locality-sensitive support vector machine by exploring local correlation and global regularization,” in IEEE Conference on Computer Vision and Pattern Recognition . IEEE, 2011, pp. 841–848.
5[5] L. Kang, H. L. Du, H. Zhang, and W. L. Ma, “Systematic research on the application of steel slag resources under the background of big data,” Complexity , vol. 2018, pp. 1–12, 2018.
6[6] Q. Liu, X. Lu, Z. He, C. Zhang, and W.-S. Chen, “Deep convolutional neural networks for thermal infrared object tracking,” Knowledge-Based Systems , vol. 134, pp. 189–198, 2017.
7[7] J. Li, B. Zhang, and D. Zhang, “Shared autoencoder gaussian process latent variable model for visual classification,” IEEE Transactions on Neural Networks and Learning Systems , vol. 29, no. 9, pp. 4272–4286, 2017.
8[8] W. Deng, R. Yao, H. Zhao, X. Yang, and G. Li, “A novel intelligent diagnosis method using optimal ls-svm with improved pso algorithm,” Soft Computing , pp. 1–18, 2017.