Transition Subspace Learning based Least Squares Regression for Image   Classification

Zhe Chen; Xiao-Jun Wu; and Josef Kittler

arXiv:1905.05445·cs.CV·June 17, 2019

Transition Subspace Learning based Least Squares Regression for Image Classification

Zhe Chen, Xiao-Jun Wu, and Josef Kittler

PDF

Open Access

TL;DR

This paper introduces TSL-LSR, a novel method for image classification that learns a transition subspace with low-rank constraints to better preserve data structure and improve discriminative power.

Contribution

The paper proposes a transition subspace learning approach with low-rank constraints for multicategory image classification, addressing overfitting and data structure preservation.

Findings

01

Outperforms state-of-the-art algorithms on multiple datasets

02

Effectively captures intrinsic data structures

03

Reduces overfitting in projection learning

Abstract

Only learning one projection matrix from original samples to the corresponding binary labels is too strict and will consequentlly lose some intrinsic geometric structures of data. In this paper, we propose a novel transition subspace learning based least squares regression (TSL-LSR) model for multicategory image classification. The main idea of TSL-LSR is to learn a transition subspace between the original samples and binary labels to alleviate the problem of overfitting caused by strict projection learning. Moreover, in order to reflect the underlying low-rank structure of transition matrix and learn more discriminative projection matrix, a low-rank constraint is added to the transition subspace. Experimental results on several image datasets demonstrate the effectiveness of the proposed TSL-LSR model in comparison with state-of-the-art algorithms

Tables3

Y = Y + μ ​ (Ω - P) .

(12)

Table 2. TABLE I: Brief description of the used five datasets.

	Classes	Features	Total Num.	Training Num.
AR	100	540	2600	1000
CMU PIE	68	1024	11554	680
Feret	200	1600	1400	800
COIL-20	20	1024	1440	200

Table 3. TABLE II: Classification accuracies (%) of different algorithms on different datasets.

Algorithms	AR	CMU PIE	Feret	COIL-20
LRC[7]	74.12 $\pm$ 1.50	75.67 $\pm$ 1.01	46.58 $\pm$ 1.33	92.30 $\pm$ 1.15
CRC[8]	93.36 $\pm$ 0.53	86.39 $\pm$ 0.60	57.07 $\pm$ 1.79	89.09 $\pm$ 1.48
ProCRC[9]	95.28 $\pm$ 0.41	89.00 $\pm$ 0.37	64.40 $\pm$ 2.54	90.61 $\pm$ 0.95
DLSR[10]	93.79 $\pm$ 0.50	87.54 $\pm$ 0.79	71.15 $\pm$ 1.27	93.27 $\pm$ 1.43
ReLSR[11]	94.53 $\pm$ 0.56	88.18 $\pm$ 0.79	72.98 $\pm$ 2.19	93.65 $\pm$ 1.94
GReLSR[12]	95.18 $\pm$ 0.74	86.88 $\pm$ 0.72	70.38 $\pm$ 2.14	90.98 $\pm$ 1.62
RLSL[13]	94.21 $\pm$ 0.35	87.70 $\pm$ 0.63	68.33 $\pm$ 1.57	93.75 $\pm$ 1.87
TSL-LSR (ours)	96.34 $\pm$ 0.43	89.92 $\pm$ 0.35	85.73 $\pm$ 1.39	94.34 $\pm$ 1.02

Equations29

W min ∥ W X - H ∥_{F}^{2} + λ ∥ W ∥_{F}^{2}

W min ∥ W X - H ∥_{F}^{2} + λ ∥ W ∥_{F}^{2}

W, Q, Ω min \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + α ∥Ω ∥_{*} + \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} +

W, Q, Ω min \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + α ∥Ω ∥_{*} + \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} +

\frac{λ _{1}}{2} ∥ W ∥_{F}^{2} + \frac{λ _{2}}{2} ∥ Q ∥_{F}^{2}

L (W, Q, Ω, P, Y) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + α ∥ P ∥_{*} +

L (W, Q, Ω, P, Y) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + α ∥ P ∥_{*} +

\frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} + \frac{λ _{1}}{2} ∥ W ∥_{F}^{2} + \frac{λ _{2}}{2} ∥ Q ∥_{F}^{2} +

\frac{μ}{2} ∥Ω - P + \frac{Y}{μ} ∥_{F}^{2}

L (W) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + \frac{λ _{1}}{2} ∥ W ∥_{F}^{2}

L (W) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + \frac{λ _{1}}{2} ∥ W ∥_{F}^{2}

W = Ω X^{T} (X X^{T} + λ_{1} I)^{- 1}

W = Ω X^{T} (X X^{T} + λ_{1} I)^{- 1}

L (Q) = \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} + \frac{λ _{2}}{2} ∥ Q ∥_{F}^{2}

L (Q) = \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} + \frac{λ _{2}}{2} ∥ Q ∥_{F}^{2}

Q = β H Ω^{T} (β Ω Ω^{T} + λ_{2} I)^{- 1}

Q = β H Ω^{T} (β Ω Ω^{T} + λ_{2} I)^{- 1}

L (Ω) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} + \frac{μ}{2} ∥Ω - P + \frac{Y}{μ} ∥_{F}^{2}

L (Ω) = \frac{1}{2} ∥ W X - Ω ∥_{F}^{2} + \frac{β}{2} ∥ Q Ω - H ∥_{F}^{2} + \frac{μ}{2} ∥Ω - P + \frac{Y}{μ} ∥_{F}^{2}

Ω = [(μ + 1) I + β Q^{T} Q]^{- 1} (W X + β Q^{T} H + μ P - Y)

Ω = [(μ + 1) I + β Q^{T} Q]^{- 1} (W X + β Q^{T} H + μ P - Y)

L (P) = α ∥ P ∥_{*} + \frac{μ}{2} ∥Ω - P + \frac{Y}{μ} ∥_{F}^{2}

L (P) = α ∥ P ∥_{*} + \frac{μ}{2} ∥Ω - P + \frac{Y}{μ} ∥_{F}^{2}

P = I_{\frac{α}{μ}} (Ω + \frac{Y}{μ})

P = I_{\frac{α}{μ}} (Ω + \frac{Y}{μ})

μ = min (μ_{ma x}, ρ μ) .

μ = min (μ_{ma x}, ρ μ) .

i f ∥Ω - P ∥_{\infty} \leq t o l .

i f ∥Ω - P ∥_{\infty} \leq t o l .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Remote-Sensing Image Classification · Sparse and Compressive Sensing Techniques

Full text

Transition Subspace Learning based Least Squares Regression for Image Classification

Zhe Chen, Xiao-Jun Wu∗, and Josef Kittler, $*$ Corresponding author. Zhe Chen and Xiao-Jun Wu are with the School of Internet of Things, Jiangnan University, Wuxi 214122, China.

E-mail: [email protected], [email protected] Josef Kittler is with the Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, U.K.

E-mail: [email protected]

Abstract

Only learning one projection matrix from original samples to the corresponding binary labels is too strict and will consequentlly lose some intrinsic geometric structures of data. In this paper, we propose a novel transition subspace learning based least squares regression (TSL-LSR) model for multicategory image classification. The main idea of TSL-LSR is to learn a transition subspace between the original samples and binary labels to alleviate the problem of overfitting caused by strict projection learning. Moreover, in order to reflect the underlying low-rank structure of transition matrix and learn more discriminative projection matrix, a low-rank constraint is added to the transition subspace. Experimental results on several image datasets demonstrate the effectiveness of the proposed TSL-LSR model in comparison with state-of-the-art algorithms. This paper is under consideration at Pattern Recognition Letters.

Index Terms:

Least squares regression, transition subspace learning, low-rank structure constraint, multicategory image classification.

I Introduction

Least squares regression (LSR) is a very popular tool in the field of pattern recognition, becasuse of its computational efficiency and mathematical tractability. Many modified models, including LASSO regression [1], partial LSR [2], least-square support vector machine [3], kernel ridge regression [4], weight LSR [5], were proposed for classification tasks. Besides, some representation based classification algorithms, such as sparse representation based classification (SRC) [6], linear regression based classification (LRC) [7], collaborative representation based classification (CRC) [8] and probabilistic CRC (ProCRC) [9], are also calculated under the LSR model. These algorithms have achieved varying degrees of success in improving classification accuracy.

Consider $n$ training samples $\{x_{1},x_{2},...,x_{n}\}$ from $c$ classes, where $x_{i}\in R^{d}$ denotes a sample vector. $d$ is the dimensionality of the sample. If collecting these samples as a training matrix $X=[x_{1},x_{2},...,x_{n}]\in R^{d\times n}$ , the standard LSR model can be defined as follows

[TABLE]

where $\lambda$ is a regularization parameter and $W\in R^{c\times d}$ is the projection matrix which to be learned. $Y=[y_{1},y_{2},...,y_{n}]\in R^{c\times n}\ (c\geq 2)$ is the binary label matrix. The $i$ th column of $H$ , i.e., $h_{i}=[0,0,...,0,1,0,...,0]^{T}\in R^{c}$ , is the label vector of sample $x_{i}$ . Suppose $x_{i}$ is from the $j$ th class $(j=1,2,...,c)$ , then only the $j$ th element of $h_{i}$ is equal to 1 and all the others are 0. Obviously, problem (1) has a closed-form solution $\hat{W}=HX^{T}(XX^{T}+\lambda I)^{-1}$ . For a given test sample $y\in R^{d}$ , LSR predicts its label as $l=argmax_{i}(Wy)_{i}$ , where $(Wy)_{i}$ is the $i$ th entry of $Wy$ .

In recent years, researchers developing LSR have focused more on learning relaxed regression targets to replace zero-one labels. For example, Xiang et al. [10] presented a discriminative least squares regression (DLSR) model by utilizing a technique called $\varepsilon$ -dragging. The idea of DLSR was to enlarge the margins between the true and the false classes as much as possible, after the original samples are projected into corresponding label space, which intuitively facilitates classification. Retargeted LSR (ReLSR) [11] directly learned the regression targets from data which can guarantee all samples are correctly classified with the large margins. Wang et al. [12] proposed a new groupwise ReLSR (GReLSR) model by introducing a groupwise regularization term to encourage the within-class samples have similar translation values.

However, directly minimizing the regression error between the projection features and labels is too restrictive. Only one projection matrix is not enough to contain sufficient discriminative information. Besides, both $\varepsilon$ -dragging and margin constraint techniques can also enlarge the distances between the within-class regression targets. In addition to learning relaxed targets, RLSL [13] proposed to learn a latent feature subspace that can be regarded as a intermediate between the original samples and binary labels. Nevertheless, RLSL did not take into account the structural characteristics of learned latent subspace.

In this paper, a novel transition subspace learning based LSR (TSL-LSR) model is proposed for multiclass classification. The main advantage of TSL-LSR is the learning of transition subspace which can preserve more underlying structural information in the learned projection. Specifically, the contributions of TSL-LSR can be highlighted as follows

(1) We propose to learn a transition subspace to avoid the problem of over-fitting, which is more flexible than learning projection from samples to zero-one labels directly.

(2) TSL-LSR first transforms the original samples into a transition subspace, then transforms the transition subspace into the space of binary labels. Hence, there are two projection matrices to be learned in the TSL-LSR model and both of these two matrices are used for classification.

(3) To guarantee consistency and global optimum of transformation learning, two projection matrices are learned in a joint framework.

(4) A low-rank constraint is imposed on the transition matrix to capture the underlying feature structures (low-rank structure) of different classes.

(5) The low-rank transition subspace can also be extended to the slack targets based LSR models which is helpful to learn similar and compact within-class regression targets.

II Transition Subspace Learning based Least Squares Regression (TSL-LSR)

II-A The Model of TSL-LSR

Since binary labels already have enough discriminability for classification, TSL-LSR still uses the zero-one labels as the final regression targets. But unlike DLSR, ReLSR and GReLSR, TSL-LSR learns discriminative projections by introducing a low-rank transition subspace to avoid the loss of structural information, rather than relaxing the binary regression targets. The model of TSL-LSR can be formulated as

[TABLE]

where $\alpha$ , $\beta$ , $\lambda_{1}$ and $\lambda_{2}$ are positive regularization parameters. $W\in R^{p\times d}$ , $Q\in R^{c\times p}$ and $\Omega\in R^{p\times n}$ are variables which need to be optimized. $\Omega$ is the transition matrix and $p$ is the dimensionality of transition subspace. $W$ and $Q$ are two projection matrices. $\|\bullet\|_{*}$ is the nuclear norm operator (the sum of matrix singular values) and $\|\Omega\|_{*}$ denotes the low-rank constraint on matrix $\Omega$ .

The consequence of introducing the transitional transformation space, $\Omega$ , is that TSL-LSR must learn two projection matrices in one model. However, this is more flexible than learning one projection matrix. The first projection matrix, $W$ , is used to transform the original samples into the transition subspace, and the second, $Q$ , is used to transform the transition subspace into the space of binary labels. The reasons for adding a low-rank constraint on transition subspace $\Omega$ can be summarized as follows

(1) The final regression targets, i.e. label matrix $H$ , are low-rank (rank= $c$ ), thus it is reasonable to assume the transition space is also low-rank.

(2) For real-world image classification tasks, images are often collected in realistic conditions, so that they are subject to noise, which has an adverse effect on classification. Thus we assume that the features obtained after the first-step projection, i.e. $WX$ , are heterogeneous. We try to recover a low-rank subspace from the corrupted features based on the assumption that the clean data structures are approximately drawn from a low-rank subspace. As a result, more useful structure information of images can be captured during the transformation learning process. The proposed learning framework (2) is illustrated in Fig. 1. As shown in Fig. 1, we find that the features extracted by our TSL-LSR model include two parts: the first-step features $\Omega$ and the second-step features $Q\Omega$ .

II-B Optimization of TSL-LSR

The objective function in (2) cannot be directly optimized because the variables (i.e, $W$ , $Q$ and $\Omega$ ) are interdependent. Therefore, we use the alternating direction multipliers method (ADMM) [14] to solve the optimization problem. We first introduce an auxiliary variable $P$ to make problem (2) separable and give its augmented Lagrangian function as

[TABLE]

where $Y$ is the Lagrangian multiplier, $\mu>0$ is the penalty parameter. Each variable, such as $W$ , $Q$ , $\Omega$ and $P$ , is updated with other variables fixed.

Update $W$ : By fixing variables $Q$ , $\Omega$ and $P$ , $W$ can be obtained by minimizing the following problem

[TABLE]

We set the derivative of $L(W)$ with respect to $W$ to zero, and obtain the following closed-form solution

[TABLE]

Update $Q$ : $Q$ can be obtained by minimizing the following problem

[TABLE]

which has a closed-form solution as

[TABLE]

Update $\Omega$ : $\Omega$ can be obtained by minimizing the following problem

[TABLE]

Likewise, $\Omega$ has a closed-form solution

[TABLE]

Update $P$ : $P$ can be obtained by minimizing the following problem

[TABLE]

Formula (10) can be optimized by the singular value thresholding algorithm [15]. The optimal solution of (10) is

[TABLE]

where $I_{\zeta}(\Theta)$ is the singular value shrinkage operator. The complete optimization procedures are summarized in Algorithm 1.

Next, we analyze the computational complexity of Algorithm 1. Following [16], the main time-consuming steps of Algorithm 1 are

(1) Matrix inverse in Eq. (5), (7), and (9).

(2) Singular value decomposition in Eq. (11).

The complexity of pre-computing $X^{T}(XX^{T}+\lambda_{1}I)^{-1}$ in Eq. (5) is $O(d^{3})$ . The complexity of computing each of $(\beta\Omega\Omega^{T}+\lambda_{2}I)^{-1}$ in Eq. (7) and $[(\mu+1)I+\beta Q^{T}Q]^{-1}$ in Eq. (9) is $O(c^{3})$ . The complexity of singular value decomposition in Eq. (11) is $O(n^{3})$ . Thus the final time complexity for Algorithm 1 is about $O(d^{3}+\tau(c^{3}+n^{3}))$ , where $\tau$ is the number of iterations.

II-C Classification

Once the optimal projection matrices $W$ and $Q$ are obtained, we can use them to classify test samples. Given a new test sample $y\in R^{d}$ , its regression is $QWy$ . Then, the nearest-neighbor (NN) classifier is used to predict the label of $y$ .

III Experiments

We compare the proposed TSL-LSR model with four latest LSR model based classification methods, including DLSR [10], ReLSR [11], GReLSR [12], RLSL [13], and three representation based classification methods, including LRC [7], CRC [8], and ProCRC [9], on a range of different datasets. For TSL-LSR, DLSR, ReLSR, GReLSR and RLSL, we use the NN classifier. The used datasets consists of two types: (1) Face: the AR [17], CMU PIE [18] and Feret [19] datasets; (2) Object: the COIL-20 [20] dataset. For each dataset, we randomly select several images of each class for training, and the remaining images are used for testing. We repeat all the experiments ten times and report the mean classification results (mean $\pm$ std). The brief description of these datasets are shown in Table I.

III-A Classification results on different datasets

We first need to determine the value of $p$ , where $p$ is the row dimensionality of transition matrix $\Omega$ . In fact, it is very difficult to tune its value, because $p$ could be $(0,+\infty)$ . From [21], we know $p$ can be set to around $c$ , where $c$ is the number of classes. Fig. 2 presents the classification accuracies (%) versus the value of $p$ on two face datasets. We can see that the change in accuracy is not obvious while $p>c$ and the peak is achieved if $p$ is approximately equal to $c$ . Therefore, in our experiments, we directly fix $p=c$ on all datasets.

The comparative classification results on five datasets are shown in Table II. As shown in Table II, our TSL-LSR model consistently achieves better accuracies than the other algorithms, including the latest two algorithms, such as GReLSR and RLSL. This is mainly because both DLSR, ReLSR and GReLSR algorithms focus on learning slack regression targets without guarding against the problem of over-fitting. In contrast, TSL-LSR introduces a low-rank transition subspace to alleviate the structural information loss caused by restrictive matrix projection. Its learned two projection matrices have a greater capacity to capture the discriminative information conveyed by the data during projection learning. To further validate that whether the learned two projections from TSL-LSR model can capture discriminative features from original samples, we use the t-SNE algorithm [22] to visualize the distribution of the extracted features. From Fig. 3, we can find that TSL-LSR correctly distributes all the samples into their own subspace and the distribution of intra-class samples are very compact which indicates that the extracted features perform ideal inter-class separability and intra-class compactness. This also demonstrates that the transition subspace learning is beneficial for classification.

III-B Convergence Validation

Based on the optimization procedures in Section II(B), it is easy to prove that the proposed TSL-LSR model is convex with respect to each variable. In this section, we validate the convergence of Algorithm 1 on two datasets. The convergence results are shown in Fig. 4. We can see that Algorithm 1 converges very well, with the value of objective function of TSL-LSR monotonically decreasing with the increasing number of iterations. This confirms the effectiveness of the adopted optimization algorithm.

III-C Parameter Sensitivity

In this section, we test the parameter sensitivity of TSL-LSR. TSL-LSR has four parameters to be tuned in our experiments. The parameters $\lambda_{1}$ and $\lambda_{2}$ are both set to 0.01, so we just focus on selecting the values of parameters $\alpha$ and $\beta$ from the candidate set $\{0.001,0.005,0.01,0.05,0.1,0.5,1\}$ . The classification accuracy as a function of different parameter values on the four datasets are shown in Fig. 5. It is apparent that the classification accuracy of TSL-LSR is not very sensitive to the values of $\alpha$ and $\beta$ .

IV Conclusion

In this paper, an effective transition subspace learning based least squares regression model (TSL-LSR) is proposed for multicategory image classification. Different from traditional LSR based regression models, which directly learn projection from original samples to corresponding label subspace, TSL-LSR tries to learn a low-rank transition subspace to avoid the problem of overfitting caused by restrictive projection learning. Moreover, TSL-LSR imposes a low-rank constraint on the transition matrix to learn more underlying structures of data. Two discriminative projection matrices are learned for classification. Extensive experiments demonstrate the effectiveness of the proposed method.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Tibshirani, ”Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. B (Methodol.), vol. 58, no. 1, pp. 267-288, 1996.
2[2] S. Wold, H. Ruhe, H. Wold, and W. Dunn, ”The collinearity problem in linear regression. the partial least squares (PLS) approach to generalized inverses,” J. Sci. Stat. Comput., vol. 5, no. 3, pp. 735-743, Jan. 1984
3[3] L. Jiao, L. Bo, and L. Wang, ”Fast sparse approximation for least squares support vector machine,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 685-697, May 2007.
4[4] S. An, W. Liu, and S. Venkatesh, ”Face recognition using kernel ridge regression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN, USA, pp. 1-8, Jun. 2007.
5[5] T. Strutz, ”Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond,” Wiesbaden, Germany: Vieweg, 2010.
6[6] J. Wright, A.Y. Yang, A. Ganesh, et al, ”Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210-227, 2009.
7[7] I. Naseem, R. Togneri, and M. Bennamoun, ”Linear regression for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 11, pp. 2106-2112, 2010.
8[8] L. Zhang, M. Yang, and X. Feng, ”Sparse representation or collaborative representation: Which helps face recognition?” in Proc. of IEEE Int. Conf. Comput. Vis., pp. 471-478, 2011.