An EM Based Probabilistic Two-Dimensional CCA with Application to Face   Recognition

Mehran Safayani; Seyed Hashem Ahmadi; Homayun Afrabandpey and; Abdolreza Mirzaei

arXiv:1702.07884·cs.CV·August 7, 2017

An EM Based Probabilistic Two-Dimensional CCA with Application to Face Recognition

Mehran Safayani, Seyed Hashem Ahmadi, Homayun Afrabandpey and, Abdolreza Mirzaei

PDF

TL;DR

This paper introduces a probabilistic framework for 2DCCA, called P2DCCA, using an EM algorithm, which improves face recognition performance under various challenging conditions.

Contribution

It presents the first probabilistic formulation of 2DCCA and develops an EM-based algorithm for better parameter estimation in face recognition.

Findings

01

Superior loading factor estimation over 2DCCA

02

Robust face recognition across different conditions

03

Effective on synthetic and real datasets

Abstract

Recently, two-dimensional canonical correlation analysis (2DCCA) has been successfully applied for image feature extraction. The method instead of concatenating the columns of the images to the one-dimensional vectors, directly works with two-dimensional image matrices. Although 2DCCA works well in different recognition tasks, it lacks a probabilistic interpretation. In this paper, we present a probabilistic framework for 2DCCA called probabilistic 2DCCA (P2DCCA) and an iterative EM based algorithm for optimizing the parameters. Experimental results on synthetic and real data demonstrate superior performance in loading factor estimation for P2DCCA compared to 2DCCA. For real data, three subsets of AR face database and also the UMIST face database confirm the robustness of the proposed algorithm in face recognition tasks with different illumination conditions, facial expressions, poses…

Tables6

Table 1. Table 1 : Comparison of the average recognition accuracy rates of the nine evaluated algorithms on AR-1 (%)

Table 2. Table 2 : Comparison of the average recognition accuracy rates of the nine evaluated algorithms on AR-2 (%)

Table 3. Table 3 : Comparison of the average recognition accuracy rates of the nine evaluated algorithms on AR-3 (%)

Table 4. Table 4 : Comparison of the average recognition accuracy rates of the nine evaluated algorithms on UMIST (%)

Table 5. Table 5 : The ρ 𝜌 \rho -value associated with the null hypothesis: “no significant difference between recognition rates of P2DCCA and the corresponding algorithm”

Data Sets \Algorithms	CCA	PCCA	2DCCA
AR-1	1.9e-06	9.5e-07	8.6e-10
AR-2	6.1e-12	5.7e-12	7.5e-05
AR-3	2.5e-05	4.8e-05	5.2e-06
UMIST	6.1e-05	3.5e-05	5.7e-03

Table 6. Table 6 : Time complexity of algorithms

Equations103

ρ = \frac{co v ( w _{1}^{T} t _{1} , w _{2}^{T} t _{2} )}{2 v a r ( w _{1}^{T} t _{1} ) v a r ( w _{2}^{T} t _{2} )} = \frac{w _{1}^{T} Σ _{12} w _{2}}{2 ( w _{1}^{T} Σ _{11} w _{1} ) ( w _{2}^{T} Σ _{22} w _{2} )}

ρ = \frac{co v ( w _{1}^{T} t _{1} , w _{2}^{T} t _{2} )}{2 v a r ( w _{1}^{T} t _{1} ) v a r ( w _{2}^{T} t _{2} )} = \frac{w _{1}^{T} Σ _{12} w _{2}}{2 ( w _{1}^{T} Σ _{11} w _{1} ) ( w _{2}^{T} Σ _{22} w _{2} )}

w_{1}, w_{2} argmax w_{1}^{T} Σ_{12} w_{2}

w_{1}, w_{2} argmax w_{1}^{T} Σ_{12} w_{2}

s . t . w_{1}^{T} Σ_{11} w_{2} = 1

w_{2}^{T} Σ_{22} w_{2} = 1

[0 Σ_{21} Σ_{12} 0] [w_{1} w_{2}] = λ [Σ_{11} 0 0 Σ_{22}] [w_{1} w_{2}]

[0 Σ_{21} Σ_{12} 0] [w_{1} w_{2}] = λ [Σ_{11} 0 0 Σ_{22}] [w_{1} w_{2}]

t_{i} = W_{i} z + μ_{i} + ϵ_{i} i \in {1, 2}

t_{i} = W_{i} z + μ_{i} + ϵ_{i} i \in {1, 2}

W_{i} = Σ_{ii} U_{i d} M_{i}, i \in {1, 2}

W_{i} = Σ_{ii} U_{i d} M_{i}, i \in {1, 2}

u_{1}, u_{2}, v_{1}, v_{2} argmax co v (u_{1}^{T} T_{1} v_{1}, u_{2}^{T} T_{2} v_{2})

u_{1}, u_{2}, v_{1}, v_{2} argmax co v (u_{1}^{T} T_{1} v_{1}, u_{2}^{T} T_{2} v_{2})

s . t . v a r (u_{1}^{T} T_{1} v_{1}) = 1

v a r (u_{2}^{T} T_{2} v_{2}) = 1,

[0 Σ_{21}^{r} Σ_{12}^{r} 0] [u_{1} u_{2}] = λ [Σ_{11}^{r} 0 0 Σ_{22}^{r}] [u_{1} u_{2}]

[0 Σ_{21}^{r} Σ_{12}^{r} 0] [u_{1} u_{2}] = λ [Σ_{11}^{r} 0 0 Σ_{22}^{r}] [u_{1} u_{2}]

[0 Σ_{21}^{l} Σ_{12}^{l} 0] [v_{1} v_{2}] = λ [Σ_{11}^{l} 0 0 Σ_{22}^{l}] [v_{1} v_{2}]

[0 Σ_{21}^{l} Σ_{12}^{l} 0] [v_{1} v_{2}] = λ [Σ_{11}^{l} 0 0 Σ_{22}^{l}] [v_{1} v_{2}]

Σ_{ij}^{r} = \frac{1}{N} n = 1 \sum N (T_{i, n} - μ_{i}) v_{i} v_{j}^{T} (T_{j, n} - μ_{j})^{T}, i, j \in {1, 2}

Σ_{ij}^{r} = \frac{1}{N} n = 1 \sum N (T_{i, n} - μ_{i}) v_{i} v_{j}^{T} (T_{j, n} - μ_{j})^{T}, i, j \in {1, 2}

Σ_{ij}^{l} = \frac{1}{N} n = 1 \sum N (T_{i, n} - μ_{i}) u_{i} u_{j}^{T} (T_{j, n} - μ_{j})^{T}, i, j \in {1, 2}

Σ_{ij}^{l} = \frac{1}{N} n = 1 \sum N (T_{i, n} - μ_{i}) u_{i} u_{j}^{T} (T_{j, n} - μ_{j})^{T}, i, j \in {1, 2}

T_{i} = U_{i} Z V_{i}^{T} + μ_{i} + Ξ_{i} i \in {1, 2}

T_{i} = U_{i} Z V_{i}^{T} + μ_{i} + Ξ_{i} i \in {1, 2}

L_{c} = n = 1 \sum N l o g p (T_{1, n}, T_{2, n}, Z_{n})

L_{c} = n = 1 \sum N l o g p (T_{1, n}, T_{2, n}, Z_{n})

T_{i}^{l} = U_{i} Z^{l} + Ξ_{i}^{l}, i = 1, 2

T_{i}^{l} = U_{i} Z^{l} + Ξ_{i}^{l}, i = 1, 2

T_{i}^{r} = V_{i} Z^{r} + Ξ_{i}^{r}, i = 1, 2

T_{i}^{r} = V_{i} Z^{r} + Ξ_{i}^{r}, i = 1, 2

p (T_{1}, T_{2}, Z) \propto p (T_{1}^{l}, T_{2}^{l}, Z^{l}) p (T_{1}^{r}, T_{2}^{r}, Z^{r})

p (T_{1}, T_{2}, Z) \propto p (T_{1}^{l}, T_{2}^{l}, Z^{l}) p (T_{1}^{r}, T_{2}^{r}, Z^{r})

L_{c} = n = 1 \sum N l o g (p (T_{1, n}^{l}, T_{2, n}^{l}, Z_{n}^{l}) p (T_{1, n}^{r}, T_{2, n}^{r}, Z_{n}^{r}))

L_{c} = n = 1 \sum N l o g (p (T_{1, n}^{l}, T_{2, n}^{l}, Z_{n}^{l}) p (T_{1, n}^{r}, T_{2, n}^{r}, Z_{n}^{r}))

= n = 1 \sum N l o g p (T_{1, n}^{l}, T_{2, n}^{l}, Z_{n}^{l}) + n = 1 \sum N l o g p (T_{1, n}^{r}, T_{2, n}^{r}, Z_{n}^{r}) .

p (T_{i}^{l}) = j = 1 \prod n \textprime p (t_{i, j}^{l}), i \in {1, 2}

p (T_{i}^{l}) = j = 1 \prod n \textprime p (t_{i, j}^{l}), i \in {1, 2}

p (t_{i, j}^{l}) \sim N (μ_{i, j}^{l}, U_{i} U_{i}^{T} + Ψ_{i}^{l}), i = 1, 2

p (t_{i, j}^{l}) \sim N (μ_{i, j}^{l}, U_{i} U_{i}^{T} + Ψ_{i}^{l}), i = 1, 2

p (τ_{j}^{l}) \sim N (m_{j}^{l}, Σ^{l}), j \in [1, n \textprime]

p (τ_{j}^{l}) \sim N (m_{j}^{l}, Σ^{l}), j \in [1, n \textprime]

p (T_{1, n}^{l}, T_{2, n}^{l}, Z_{n}^{l}) = j = 1 \prod n \textprime p (τ_{n, j}^{l}, z_{n, j}^{l}) = j = 1 \prod n \textprime p (τ_{n, j}^{l} ∣ z_{n, j}^{l}) p (z_{n, j}^{l})

p (T_{1, n}^{l}, T_{2, n}^{l}, Z_{n}^{l}) = j = 1 \prod n \textprime p (τ_{n, j}^{l}, z_{n, j}^{l}) = j = 1 \prod n \textprime p (τ_{n, j}^{l} ∣ z_{n, j}^{l}) p (z_{n, j}^{l})

U_{t + 1} = \frac{Σ ^{l} ( Ψ _{t}^{l} ) ^{- 1} U _{t} ( M _{t}^{l} ) ^{- 1}}{( M _{t}^{l} ) ^{- 1} + ( M _{t}^{l} ) ^{- 1} U _{t}^{l} ( Ψ _{t}^{l} ) ^{- 1} Σ ^{l} ( Ψ _{t}^{l} ) ^{- 1} U _{t} ( M _{t}^{l} ) ^{- 1}}

U_{t + 1} = \frac{Σ ^{l} ( Ψ _{t}^{l} ) ^{- 1} U _{t} ( M _{t}^{l} ) ^{- 1}}{( M _{t}^{l} ) ^{- 1} + ( M _{t}^{l} ) ^{- 1} U _{t}^{l} ( Ψ _{t}^{l} ) ^{- 1} Σ ^{l} ( Ψ _{t}^{l} ) ^{- 1} U _{t} ( M _{t}^{l} ) ^{- 1}}

\displaystyle\Psi_{t+1}^{l}=\left(\begin{array}[]{cc}(\widetilde{\Sigma}^{l}-\widetilde{\Sigma}^{l}(\Psi_{t}^{l})^{-1}U_{t}(M_{t}^{l})^{-1}U_{t+1}^{T})_{11}&0\\ 0&(\widetilde{\Sigma}^{l}-\widetilde{\Sigma}^{l}(\Psi_{t}^{l})^{-1}U_{t}(M_{t}^{l})^{-1}U_{t+1}^{T})_{22}\end{array}\right)

\displaystyle\Psi_{t+1}^{l}=\left(\begin{array}[]{cc}(\widetilde{\Sigma}^{l}-\widetilde{\Sigma}^{l}(\Psi_{t}^{l})^{-1}U_{t}(M_{t}^{l})^{-1}U_{t+1}^{T})_{11}&0\\ 0&(\widetilde{\Sigma}^{l}-\widetilde{\Sigma}^{l}(\Psi_{t}^{l})^{-1}U_{t}(M_{t}^{l})^{-1}U_{t+1}^{T})_{22}\end{array}\right)

Σ^{l} = \frac{1}{N} n = 1 \sum N [(T_{1, n}^{l} - μ_{1}^{l})^{T} (T_{2, n}^{l} - μ_{2}^{l})^{T}]^{T} [(T_{1, n}^{l} - μ_{1}^{l})^{T} (T_{2, n}^{l} - μ_{2}^{l})^{T}]

Σ^{l} = \frac{1}{N} n = 1 \sum N [(T_{1, n}^{l} - μ_{1}^{l})^{T} (T_{2, n}^{l} - μ_{2}^{l})^{T}]^{T} [(T_{1, n}^{l} - μ_{1}^{l})^{T} (T_{2, n}^{l} - μ_{2}^{l})^{T}]

p (T_{i}^{r}) = j = 1 \prod m \textprime p (t_{i, j}^{r}), i \in {1, 2}

p (T_{i}^{r}) = j = 1 \prod m \textprime p (t_{i, j}^{r}), i \in {1, 2}

p (t_{i, j}^{r}) \sim N (μ_{i, j}^{r}, V_{i} V_{i}^{T} + Ψ_{i}^{r}), i \in {1, 2}

p (t_{i, j}^{r}) \sim N (μ_{i, j}^{r}, V_{i} V_{i}^{T} + Ψ_{i}^{r}), i \in {1, 2}

p (τ_{j}^{r}) \sim N (m_{j}^{r}, Σ^{r}), j \in [1, m \textprime]

p (τ_{j}^{r}) \sim N (m_{j}^{r}, Σ^{r}), j \in [1, m \textprime]

p (T_{1, n}^{r}, T_{2, n}^{r}, Z_{n}^{r}) = j = 1 \prod m \textprime p (τ_{n, j}^{r}, z_{n, j}^{r}) = j = 1 \prod m \textprime p (τ_{n, j}^{r} ∣ z_{n, j}^{r}) p (z_{n, j}^{r})

p (T_{1, n}^{r}, T_{2, n}^{r}, Z_{n}^{r}) = j = 1 \prod m \textprime p (τ_{n, j}^{r}, z_{n, j}^{r}) = j = 1 \prod m \textprime p (τ_{n, j}^{r} ∣ z_{n, j}^{r}) p (z_{n, j}^{r})

V_{t + 1} = \frac{Σ ^{r} ( Ψ _{t}^{r} ) ^{- 1} V _{t} ( M _{t}^{r} ) ^{- 1}}{( M _{t}^{r} ) ^{- 1} + ( M _{t}^{r} ) ^{- 1} V _{t}^{r} ( Ψ _{t}^{r} ) ^{- 1} Σ ^{r} ( Ψ _{t}^{r} ) ^{- 1} V _{t} ( M _{t}^{r} ) ^{- 1}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

44institutetext: Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, IRAN

Tel.: +98 31 33919063

Fax: +98 31 33912451

44email: [email protected](Corresponding Author)22institutetext: [email protected]: [email protected] 44institutetext: [email protected]

An EM Based Probabilistic Two-Dimensional CCA with Application to Face Recognition

Mehran Safayani

Seyed Hashem Ahmadi

Homayun Afrabandpey

Abdolreza Mirzaei

(Received: date / Accepted: date)

Abstract

Recently, two-dimensional canonical correlation analysis (2DCCA) has been successfully applied for image feature extraction. The method instead of concatenating the columns of the images to the one-dimensional vectors, directly works with two-dimensional image matrices. Although 2DCCA works well in different recognition tasks, it lacks a probabilistic interpretation. In this paper, we present a probabilistic framework for 2DCCA called probabilistic 2DCCA (P2DCCA) and an iterative EM based algorithm for optimizing the parameters. Experimental results on synthetic and real data demonstrate superior performance in loading factor estimation for P2DCCA compared to 2DCCA. For real data, three subsets of AR face database and also the UMIST face database confirm the robustness of the proposed algorithm in face recognition tasks with different illumination conditions, facial expressions, poses and occlusions.

Keywords:

Canonical Correlation Analysis (CCA) Two-dimensional CCA Probabilistic Feature extraction Dimension Reduction Face recognition

1 Introduction

Although many real-world applications encounter high dimensional data, the most informative part of the data can be modeled in a low dimensional space. Moreover, processing high-dimensional data is a time consuming process and requires lots of resources. To tackle these problems, feature extraction has been used as a tool for finding a compact and meaningful data representation.

For single-mode source data, some subspace learning methods are conducted to learn more semantic description subspaces. Examples of these methods are principal component analysis (PCA) jolliffe2002principal and linear discriminant analysis (LDA). However, for observations from two sources that share some mutual information, canonical correlation analysis (CCA) hotelling1936relations is a very popular approach for dimensionality reduction. CCA seeks a lower-dimensional space where two sets of variables are maximally correlated after projecting on it. This technique is widely used in different fields of pattern recognition, computer vision, bioinformatics, etc. jia2012incremental ; wang2014inferring ; huang2014retracted . In the CCA-based methods, it is necessary to vectorize 2D image matrices. Vectorization has three main drawbacks: (I) breaking the spatial structure of image data which may cause losing potentially useful structural information among column/rows ye2004gpca , (II) leading to a high-dimensional vector space and small sample size problem which in turn makes it difficult to calculate the covariance matrices zhang20052d and (III) causing the covariance matrices to be very large which in turn makes the eigen-decomposition of such large matrices very time-consuming.

To overcome these drawbacks, in 2007 two-dimensional CCA (2DCCA) was introduced by Lee and Choi lee2007two which computes CCA directions based on 2D image matrices. The proposed 2DCCA overcomes the curse of dimensionality and significantly reduces the computational cost, by directly working with 2D images instead of reshaping them into 1D vectors. In lee2007two , higher recognition accuracies were reported using 2DCCA compared to CCA using two face databases and the time complexity has been improved.

However, an associated probabilistic model for observed data was notably absent form these feature extraction methods. A probabilistic feature extraction algorithm could be intuitively appealing for so many reasons tipping1999probabilistic . To bridge the gap, in 1999, Tipping and Bishop proposed probabilistic PCA tipping1999probabilistic based on a latent variable model known as factor analysis (FA) thompson2004exploratory ; browne1979maximum . The proposed PPCA was then used as a framework for many other new formulations for PCA klami2008probabilistic ; archambeau2006robust ; archambeau2009sparse ; zhao2012bilinear . Also, there have been some probabilistic models proposed for LDA ioffe2006probabilistic ; kaski2003informative . In 2005, Bach and Jordan bach2005probabilistic also proposed a probabilistic interpretation of CCA and estimate the parameters of their proposed model using both maximum likelihood and expectation maximization. Recently, many inspiring research proceeded in the 1D CCA domain, including kernel based, semiparametric and nonparametric methods sarvestani2016ff ; podosinnikova2016beyond ; michaeli2016nonparametric , but in the 2D CCA domain, we feel that more work is required. To bridge the gap, a probabilistic model of 2DCCA was introduced by Safayani et al. in 2011 safayani2011matrix . They showed that the maximum likelihood estimation of parameters, leads to the two dimensional canonical correlation directions. However, they didn't propose an EM based solution for their model. EM does not require the explicit eigen-decomposition of covariance matrices. Moreover, using EM it is possible to handle models with incomplete data such as mixture models where the cluster labels are the missing values wang2008probabilistic .

In this paper, we present a probabilistic interpretation of 2DCCA, referred to as P2DCCA, together with an EM based solution to estimate the parameters of the model. The proposed model can handle the small sample size problem effectively.

The rest of the paper is organized as follows: Section 2 briefly reviews some related algorithms such CCA, PCCA and 2DCCA which are necessary to understand how the proposed algorithms work. The proposed P2DCCA model is introduced in Section 3. In Section 4, some experiments on synthetic data and several face databases are given to evaluate performance of the proposed algorithm; finally, the paper is concluded in Section 5.

2 Background

2.1 Canonical Correlation Analysis (CCA)

Imagine that we are given two sets of random vectors $t_{1}$ and $t_{2}$ where $t_{1,n}\in\textrm{R}^{D_{1}}$ and $t_{2,n}\in\textrm{R}^{D_{2}}$ for $n\in{1,2,...,N}$ are realizations of the corresponding random vectors, respectively. CCA seeks transformation vectors $w_{1}\in\textrm{R}^{D_{1}}$ and $w_{2}\in\textrm{R}^{D_{2}}$ such that correlation between $w_{1}^{T}t_{1}$ and $w_{2}^{T}t_{2}$ are maximized. The correlation between $w_{1}^{T}t_{1}$ and $w_{2}^{T}t_{2}$ can be formulated as

[TABLE]

where $\Sigma_{ij}=\frac{1}{N}\sum_{n=1}^{N}(t_{i,n}-\mu_{i})(t_{j,n}-\mu_{j})^{T}$ for $i,j\in{1,2}$ is the cross-covariance matrix of $t_{1}$ and $t_{2}$ and $\mu_{i}=\frac{1}{N}\sum_{n=1}^{N}t_{i,n}$ for $i\in\{1,2\}$ denotes the mean vector of $t_{i}$ .

Then, the objective function for CCA can be written as:

[TABLE]

Optimizing such a constrained maximization problem with respect to $w_{1}$ and $w_{2}$ leads to the following generalized eigenvalue problem:

[TABLE]

By solving equation (3), $w_{1}$ and $w_{2}$ that maximize the correlation between the projected data can be found.

2.2 Probabilistic CCA

The generative model introduced by Bach and Jordan for CCA is as follows:

[TABLE]

In this model, $W_{i}\in\textmd{R}^{D_{i}\times d}\quad i\in\{1,2\}$ are linear projections that map two sets of high dimensional observed random vectors $t_{i}\in\textmd{R}^{D_{i}},i\in\{1,2\}$ to a set of lower dimensional latent vectors $z\in\textmd{R}^{d}$ . Here $\mu_{i}$ is the mean vector for $x_{i}$ and $\epsilon_{i}$ is the error term which is assumed to follow a multivariate Gaussian distributions with zero mean and inverse covariance matrix $\Psi_{i}$ . Bach and Jordan proved that the maximum likelihood estimation for the parameters of this model would lead to the canonical directions. The maximum likelihood estimates of the projection matrices are given by:

[TABLE]

where $\Sigma_{ii}$ is the sample covariance matrix, $U_{id}$ are the first $d$ canonical directions and $M_{i}\in\textmd{R}^{d\times d}$ where $i\in\{1,2\}$ are arbitrary matrices such that $M_{1}M_{2}=P_{d}$ , where $P_{d}$ is the diagonal matrix of the first $d$ canonical correlations. Figure 1 shows a graphical representation of the model.

2.3 Two-Dimensional CCA

Two-dimensional CCA (2DCCA) was proposed to tackle the problem of vectorizing data in CCA. For each random matrix $T_{1}$ and $T_{2}$ , 2DCCA introduces left transforms $u_{i}$ and right transforms $v_{i}$ where $i\in\{1,2\}$ . After the projection, data would have the form $u_{i}^{T}T_{i}v_{i}$ . 2DCCA finds these left and right transforms in a way to maximize the correlation between projected data. Therefore, the objective function of 2DCCA can be formulated as:

[TABLE]

$u_{1}$ and $u_{2}$ can be obtained by solving the generalized eigenvalue problem (7) with fixed $v_{1}$ and $v_{2}$ :

[TABLE]

In a similar way, given $u_{1}$ and $u_{2}$ , right transforms $v_{1}$ and $v_{2}$ can be found by solving

[TABLE]

where $\Sigma_{i,j}^{r}$ , $i,j\in\{1,2\}$ is the cross-covariance matrix between $T_{i}$ and $T_{j}$ and $\Sigma_{ii}^{r}$ , $i\in\{1,2\}$ is the auto-covariance matrix of $T_{i}$ defined as follows, respectively:

[TABLE]

where $T_{i,n}$ for $n\in\{1,...,N\}$ is the realization of the random matrix $T_{i}$ and $\mu_{i}=\frac{1}{N}\sum_{n=1}^{N}T_{i,n}$ is the corresponding matrix of mean values.

Left transforms ( $l_{x}$ and $l_{y}$ ) and right transforms ( $r_{x}$ and $r_{y}$ ) are obtained by iteratively solving equations (7) and (8), until convergence. The eigenvectors associated with $d_{1}$ largest eigenvalues in (7) determine left transform matrices $U_{1}$ and $U_{2}$ and the eigenvectors associated with $d_{2}$ largest eigenvalues in (8) determine right transform matrices $V_{1}$ and $V_{2}$ . Using these transform matrices it is possible to project data from a high dimensional space to a new lower dimensional feature space.

3 Probabilistic Two Dimensional CCA (P2DCCA)

In this section, we propose probabilistic two-dimensional CCA and an EM-based solution for finding the parameters of the model. In our model, observed data are modeled as two-dimensional matrices as follows:

[TABLE]

where $T_{i}\in\textmd{R}^{m_{i}\times n_{i}}$ for $i\in\{1,2\}$ are observed matrices and $Z\in\textmd{R}^{m\textprime\times n\textprime}$ is the latent matrix. $U_{i}\in\textmd{R}^{m_{i}\times m\textprime}$ , $V_{i}\in\textmd{R}^{n_{i}\times n\textprime}$ are projection matrices, $\mu_{i}$ is the mean matrix of the observed data and $\Xi_{i}$ is the residual matrix. Based on this definition, parameters of the model are $\{U_{i},V_{i}\}_{i=1}^{2}$ and the parameters of the distribution of $\Xi_{i}\textprime$ .

Let $D_{i}={\{T_{i;n}\}}_{n=1}^{N}$ where $i\in\{1,2\}$ be a set containing $N$ observed data matrices and $\{Z_{n}\}_{n=1}^{N}$ be the corresponding latent variable set. Then the complete data would be $(T_{1,n},T_{2,n},Z_{n})$ and the log likelihood of the complete data can be written as

[TABLE]

To estimate the parameters, first we must calculate expectation of the log-likelihood and then take the derivative of the expected log-likelihood with respect to each parameter. Unfortunately there is no closed-form solution for computing the projection matrices $\{U_{i},V_{i}\}_{i=1}^{2}$ simultaneously. Inspired by tao2008bayesian , a decoupled probabilistic model is employed to obtain projection matrices separately using an alternating optimization procedure. In such a model, we first assume that the value of one set of projection matrices, e.g. the right projections $\{V_{i}\}_{i=1}^{2}$ , is known. Then observations are projected to the corresponding latent spaces. The projection procedure is a probabilistic one that introduced in section (3.3). By doing this, the left probabilistic model is defined as

[TABLE]

where $Z^{l}$ is the left model latent matrix, $\mu_{i}^{l}$ is the mean matrix of the left projected observations and $\Xi_{i}^{l}$ is the noise source for left probabilistic model where columns of the noise matrix follow a normal distribution with zero mean and covariance matrix $\Psi_{i}^{l}$ . By such definition, parameter set for the left probabilistic model would be $\Theta^{l}=\{U_{i},\Psi_{i}^{l}\}_{i=1}^{2}$ which can be estimated using expectation maximization procedure. The estimation procedure is explained later in this section.

In a similar procedure and parallel to the left probabilistic model, for the right probabilistic model, we assume that the left projection matrices, i.e. $\{U_{i}\}_{i=1}^{2}$ , are known. Then observations are projected over the corresponding latent spaces,hence the right probabilistic model is defined as

[TABLE]

Similar to the left probabilistic model, $Z^{r}$ , $\mu_{i}^{r}$ and $\Xi_{i}^{r}$ are defined for the right model where in this model the noise source have $N(0,\Psi_{i}^{r})$ distribution. The parameter set for the right probabilistic model would be $\Theta^{r}=\{V_{i},\Psi_{i}^{r}\}_{i=1}^{2}$ . Using these definitions, the decoupled predictive density $p(T_{1},T_{2},Z)$ could be defined as

[TABLE]

Now we can rewrite the log likelihood of equation (15) as

[TABLE]

To apply the EM algorithm to the decoupled probabilistic model, in E-step expectation of log likelihood for the left probabilistic model and the right probabilistic model is computed, separately. Then each of the expected log likelihood is maximized with respect to its parameters. In the following subsections we describe how to optimize left and right probabilistic model respectively.

3.1 Optimizing the left probabilistic model

Let $t_{i,j}^{l}$ be the $j^{th}$ column vector of $T_{i}^{l}\in\textmd{R}^{m_{i}\times n\textprime}$ . By assuming columns of $T^{l}$ to be independent of each other, the distribution of $T_{i}^{l}$ is defined as

[TABLE]

We also consider $z_{j}^{l}\in\textmd{R}^{m\textprime\times 1}$ as the $j^{th}$ column vector of $Z^{l}$ which has normal distribution of $N(0,I)$ and also in the same way $\mu_{i,j}^{l}$ is the $j^{th}$ column vector of $\mu_{j}^{l}$ . Based on equation (13) and the distribution considered for $Z^{l}$ and $\Xi_{i}^{l}$ , it can be concluded that

[TABLE]

Suppose $\tau_{n,j}^{l}=[(t_{1,n,j}^{l})^{T}\;(t_{2,n,j}^{l})^{T}]^{T}\in\textmd{R}^{(m_{1}+m_{2})\times 1}$ , $V=[U_{1}^{T}\;U_{2}^{T}]^{T}\in\textmd{R}^{(m_{1}+m_{2})\times m\textprime}$ , $m_{j}^{l}=[(\mu_{1,j}^{l})^{T}\;(\mu_{2,j}^{l})^{T}]^{T}\in\textmd{R}^{(m_{1}+m_{2})\times 1}$ and $\Psi^{l}=\left(\begin{array}[]{cc}\Psi_{1}^{l}&0\\ 0&\Psi_{2}^{l}\end{array}\right)$ for the left probabilistic model, where $t_{i,n,j}^{l}$ refers to the $j^{th}$ column vector of $n^{th}$ image in the $i^{th}$ observation set where $i\in\{1,2\}$ . Therefore, distributions of $p(\tau_{j}^{l})$ can be obtained as follows:

[TABLE]

where $\Sigma^{l}=UU^{T}+\Psi^{l}$ and we assume $\Sigma^{l}>0$ .

Based on (17) we can write:

[TABLE]

To apply the EM algorithm to the decoupled probabilistic model, for each of the probabilistic models expectation of the log likelihood function is calculated in the E-step $E(L_{c}^{l})$ where the detail is given in the appendix, and then maximization step (M-step) is done by maximizing $E(L_{c}^{l})$ with respect to $V$ and $\Psi^{l}$ . By doing so, the values of the parameters are estimated as

[TABLE]

where $M^{l}=I+U^{T}(\Psi^{l})^{-1}U$ and $A_{t}$ shows the value of parameter $A$ in iteration $t$ and $\widetilde{\Sigma}^{l}$ is the sample covariance matrix of observed data for the left probabilistic model , i.e.

[TABLE]

3.2 Optimizing the right probabilistic model

In the manner similar to optimization of the left probabilistic model we have:

[TABLE]

where $t_{i,j}^{r}$ is the $j^{th}$ column vector of $T_{i}^{r}\in\textmd{R}^{n_{i}\times m\textprime}$ . Then $p(t_{i,j}^{r})$ is computed as:

[TABLE]

Let $\tau_{n,j}^{r}=[(t_{1,n,j}^{r})^{T}\;(t_{2,n,j}^{r})^{T}]^{T}\in\textmd{R}^{(n_{1}+n_{2})\times 1}$ , $V=[V_{1}^{T}\;V_{2}^{T}]^{T}\in\textmd{R}^{(n_{1}+n_{2})\times n\textprime}$ , $m_{j}^{r}=[(\mu_{1,j}^{r})^{T}\;(\mu_{2,j}^{r})^{T}]^{T}\in\textmd{R}^{(n_{1}+n_{2})\times 1}$ and $\Psi^{r}=\left(\begin{array}[]{cc}\Psi_{1}^{r}&0\\ 0&\Psi_{2}^{r}\end{array}\right)$ , where $t_{i,n,j}^{r}$ refers to the $j^{th}$ column vector of $n^{th}$ image in the $i^{th}$ observation set. Then $p(\tau_{j}^{r})$ and $p(T_{1,n}^{r},T_{2,n}^{r},Z_{n}^{r})$ are obtained as follows:

[TABLE]

where $\Sigma^{r}=VV^{T}+\Psi^{r}>0$ . Given the details in appendix (A), after computing $E(L_{c}^{r})$ in the E-step, the parameters $V$ and $\Psi^{r}$ are computed by maximizing the likelihood in the M-step. So, we have:

[TABLE]

where $M^{r}=I+V^{T}(\Psi^{r})^{-1}V$ and $\widetilde{\Sigma}^{r}$ is computed as follows:

[TABLE]

3.3 Probabilistic projection and dimension reduction

We can project the observation matrices into the latent space using the standard projection matrices ,i.e., $\{U_{1},U_{2},V_{1},V_{2}\}$ . However, as described in tipping1999probabilistic , it is more natural to use probabilistic projections. In this regard, we represent each projected observation matrix, $T_{i}$ , by mean of distribution of corresponding latent space, i.e., $E(Z|T_{i})$ . For the left model, it can be shown that

[TABLE]

$E(Z^{l}|T_{1})$ and $E(Z^{l}|T_{2})$ are obtained by marginalizing (40) over $T_{2}$ and $T_{1}$ respectively. So we have

[TABLE]

Similarly for the right model we have

[TABLE]

The procedure for dimension reduction is given by sequential projection in left and right models as

[TABLE]

The P2DCCA algorithm is summarized in Figure 2. The proposed P2DCCA model benefits the ability to extend to other methods such as Mixtures of P2DCCA, Bayesian P2DCCA and also to robust P2DCCA.

4 Experimental Results

We evaluated our algorithm on both synthetic and real data. In synthetic data part we verified our implementation of P2DCCA algorithm by simple synthetic data and randomly generated projected matrices. we also compared our algorithm with 2DCCA method in projection matrices estimation . For real part evaluation, the proposed P2DCCA method was used for face recognition on two well-known face image databases (AR martinez1998ar and UMIST graham1998characterising ). The AR database is divided into three subsets for evaluating the performance of the system in regard to different illumination, expression and occlusion conditions. The UMIST database is used to obtain the performance in dealing with pose variation.

4.1 Experiments on synthetic data

In this section, we aim to verify our implementation of the proposed method with the simplest possible scenario. So we generate some synthetic data and projection matrices. Then estimate the projection matrices using our method and compare them with the true ones. We know that P2DCCA estimations are up to rotation, hence to simplify comparison we assume $Z$ to be $1\times 1$ dimension. We set dimensions of $T_{1}$ and $T_{2}$ to $5\times 5$ . Then we generate 1000 samples of $Z$ from a normal distribution with zero mean and unit variance, also we randomly generate the elements of $U_{i}\in\Re^{5}$ , $V_{i}\in\Re^{5}$ , $i\in\{1,2\}$ using a uniform distribution in [0, 1] interval and consider them as the ground truth projection matrices, then $T_{i}$ will obtained using (11). In this equation, each element of the residual matrices sampled from a gaussian distribution with [math] mean and $\sigma_{i}^{2}$ variance. Having the synthetic data we run P2DCCA algorithm as discussed in Figure 2 and calculate $U_{i}$ and $V_{i}$ . We also run 2DCCA and obtain the corresponding projection matrices. Then we compare obtained matrices by these two algorithms with the ground truth projection matrices. To cancel the scale factors we divide each transform by its norm before comparison. Euclidean distance is utilized to compare the normalized transforms. Figure 3 shows the results. It is obvious that in the worst-case the distance value becomes one and in the best-case it is zero. As it is depicted in this figure, P2DCCA estimation of $U_{i}$ and $V_{i}$ for $i\in\{1,2\}$ are much closer to the ground truth compared to the those obtained by 2DCCA. In this experiment we used 1000 generated samples and set $\sigma_{i}=0.1$ . To examine the effect of these selections, we repeat our experiment with different values of sample numbers and also noise variances. Figure 4 demonstrates the results. As it can be observed from this figure the P2DCCA estimation in all cases is much closer to the ground truth compared to the 2DCCA method.

4.2 Experiments on the AR database

The AR face database contains over 4,000 color face images including frontal views of faces with different facial expressions, illumination conditions and occlusions. For most individuals, there are two sessions of images which were taken in two different time periods. Each session contains 13 images. In our experiments, we used the first session, because some individuals do not have the second session of images. We collected 1310 face images of 131 people (72 male and 59 female). For each person, there are 10 different face images in our collected images: three with different illumination conditions; three with different expressions; three with occlusions and the remaining images are those with neutral expression and no occlusion which are known as reference images in our experiments. To examine the performance of the proposed methods in different conditions, we partitioned the collected images into three subsets known as AR-1, AR-2 and AR-3. For each individual, AR-1 contains four images, three of which are images with different lighting conditions and the remaining one is the reference image. AR-2 is used to test performance of the algorithms when there exist expression variation. AR-2 involves four images per individuals: three images have different expressions and the last one is the reference image. AR-3 is prepared to test performance in the presence of occlusion. Again, this subset contains four images per individuals where three images were taken with glasses and the last one is the reference image. Figure 5 shows exemplary face images of a man and a woman in AR-1, AR-2 and AR-3, respectively. Image are gray scaled, resized and then normalized to $50\times 50$ pixels.

We compared the performance of the proposed method with a range of different supervised and unsupervised dimensionality reduction algorithms and different versions of them including PCA, LDA, CCA, PPCA, PCCA, 2DPCA, 2DLDA and 2DCCA. Both PCA and LDA based methods work with one set of data. Also LDA based methods are supervised while PCA and CCA based algorithms are unsupervised. The task here is to investigate how well different algorithms can relate face images with varying illumination conditions, expressions and occlusion, in correspondence to the reference face images. The CCA, PCA, LDA, PPCA, PCCA, 2DCCA, 2DPCA, 2DLDA and P2DCCA are used to extract features from facial images and then a 1-NN classifier is employed for classification. Note that based on output type of each of the algorithms (vector or matrix), for 2DCCA, 2DPCA, 2DLDA and P2DCCA, Frobenius distance is used to calculate the distance between two feature matrices, while for CCA, PCA, LDA, PPCA and PCCA the common Euclidean distance measure is adopted. Furthermore, it should be noted here that since PCCA suffers the small sample size problem, implementing it using formulas introduced in bach2005probabilistic caused the covariance matrices to be singular. To solve the problem, we did dimension reduction using PCA before implementing the algorithm.

To evaluate the recognition accuracy, we used “three-fold cross-validation”. As it is evident, CCA based algorithms need two sets of images for training, where in this paper the training sets are called left training set and Right training set. To form the training sets, e.g., for AR-1, neutral images (images with no illumination) are considered as Left training set, while to form the Right training set, one of the three images of each individual with different illumination conditions is selected randomly. The other two images are considered as test images. This procedure is repeated for three times, where each time a different image among the three images is selected for the Right training set, while neutral images are always used to form the Left training set. To test the performance of each algorithm, all images of both right and left training sets are projected on the new feature spaces using their corresponding transforms. Also, each of the test images is projected on both feature spaces, so we have two projection for every test image. Then we calculate the distance between each of the two projected test images and projected training images. The label of the training image with the nearest projection to any of the two test image projection determine the final class of the test image. This procedure iterates until we find the final class for all images in the test set. These final classes are compared to the real classes of the images and the recognition accuracy of each algorithm is calculated. Finally, the average recognition rate of the three round experiments is recorded as the final recognition accuracy.

Since PCA and LDA based algorithms work with one set of data, to have a fair comparison we used two images as training and the other two images as the test data, where neutral images are always in the training data together with one of the other images with different illuminations in each iterations. Again the process is repeated three times and the final accuracy is the average of the three runs. Figure 7 shows the test process for P2DCCA.

Train and test procedure for AR2 and AR3 subsets are similar to AR1 and Table 1 through Table 3 demonstrate the recognition accuracy of evaluated algorithms for the experiments conducted on AR-1, AR-2 and AR-3, respectively. In these tables, d is the dimension of the reduced feature space. Note that output for two dimensional algorithms (2DCCA, 2DPCA, 2DLDA, P2DCCA and MP2DCCA) is of matrix type with dimension $\textit{{d}}\times\textit{{d}}$ , while for CCA, PCA, LDA, PPCA and PCCA methods output is a vector of dimension d. In these results we see that P2DCCA get the best performance among all tested methods. We see about 10% improvement of recognition rate for P2DCCA over 2DCCA in AR-1 and AR-3, and about 3% improvement in AR-2.

Figure 8 shows how the log-likelihoods of the left probabilistic model and the right probabilistic model of P2DCCA improve with each iteration. As it can be seen in the figure, both left and right models converge.

It should be noted that it is very common that in the algorithms for optimizing row and column projections only one iteration with the iterative algorithm is performedconfnipsYeJL04 . Therefore in all the experiments we use one iteration of the algorithm, i.e., $T_{max}=1$ . This significantly reduces computational cost of the algorithm. We tried more iterations and got no significant improvement in the recognition rate. Figure 9 shows the results for 1 to 5 iterations for AR1, AR2 and AR3. This results support the idea of choosing $T_{max}=1$ .

4.3 Experiments on the UMIST database

The UMIST face database also known as Sheffield face database UMIST consists of 564 images of 20 subjects. Subjects have different races, sexes, and appearances. For each subject, there are images with different poses from profile to frontal view. Images have 256 grey levels with resolution of $220\times 220$ pixels. In our experiment, 360 images with 18 samples per subject are used to examine performance of different algorithms when face orientation varies significantly. Figure 10 shows 18 images of one subject.

We select frontal image as well as seven other randomly selected images for training set and the remaining images for the test set. In the training phase of CCA based methods, frontal image always is selected as the left training image and one of the seven other images as the right training image. 1-NN classifier is used for classification. This procedure is repeated for twenty times, and the average recognition rates of algorithms are reported. Table 5 shows the recognition accuracy of evaluated algorithms for the experiments conducted on UMIST. In this test while P2DCCA achieved slightly better performance compared to 2DCCA, it is not the best. In fact LDA achieved the best performance. In the AR test, LDA had 2 images per class for training, but here it has 8 images per class which leaded to best performance for this supervised method. Ignoring LDA, we see that P2DCCA performance is higher than other methods.

Since the UMIST face dataset contains 20 subjects to be discriminated, the LDA features is limited to 19. To be able to compare the results of LDA with that of the other algorithms, we showed the results for $d=5$ , $d=10$ and $d=15$ for LDA and larger values of $d$ for other algorithms.

4.4 Evaluation of the Experimental Results

The above experiments showed that the accuracy of P2DCCA is consistently better than other CCA based methods, i.e. CCA, PCCA and 2DCCA. But, a question sill remains: “Are these differences statistically significant?”. In this section we answered the question by evaluating the experimental results using independent-samples T-test (or independent t-test, for short). In this section and also the next section, we only considered CCA based algorithms including CCA, PCCA, 2DCCA and P2DCCA since the goal of this paper is to compare the functionality of the newly proposed CCA based method with that of the other CCA based algorithms. The desired significance level is $0.05$ and the null hypothesis is that there is no significant difference between recognition rates of P2DCCA and CCA, PCCA and 2DCCA, respectively. We reject the null hypothesis whenever the resulted $\rho$ -value becomes lower than $0.05$ and in this case the result can be considered statistically significant. It is necessary to note that to run the t-test in each dataset, for each algorithm we considered the highest recognition rate. Table 5 shows the $\rho$ -value of the test. As can be seen from this table, P2DCCA significantly outperforms other algorithms and the null-hypothesis has been rejected in all cases.

4.5 Computational Complexity

This section compares the computational cost of the algorithms. To compare the time complexity of the algorithms, we consider input images of size $m\times m$ where we want to reduce their dimension to $d\times d$ in case of two-dimensional algorithms. For CCA and PCCA, vectorization caused the input data to have dimension $m^{2}\times 1$ and the output to have dimension $d\times 1$ . However, for simplicity, $d$ is considered to be equal to $m$ , i.e. $d=m$ in our analysis. Table 6 shows the computational complexity of the algorithms. In this table, $N$ is the number of random samples in the dataset. It should be noted that there are two types of iteration in the corresponding methods; one is the iteration necessary for the convergence of the EM part of the algorithm and the other is the iteration for alternating the optimization procedure between left and right model. We show the former by t and the latter by $r$ in the table. However, $r=1$ in our experiments.

5 Conclusion

This paper proposed a probabilistic model for two dimensional CCA termed as P2DCCA together with an EM-based solution to estimate the parameters of the model. Experimental results demonstrated the functionality of the proposed method. The proposed P2DCCA has many advantages over 2DCCA where the most significant advantage is its ability to extend to a mixture of P2DCCA model. It may also be possible to develop a probabilistic Bayesian model for P2DCCA and gaining the benefits of a Bayesian model. These are our future works.

Appendix A

As it is mentioned in Section 3, each column of the latent matrix $Z$ has the distribution $N(0,I)$ . Furthermore, based on (13) and (14) we can write:

[TABLE]

Now from (20) and (29) and the distribution of columns of Z, we have

[TABLE]

where $|A|$ denotes the determinant of matrix $A$ .

In the E-step, expectation of log likelihood for each of the probabilistic models is calculated as:

[TABLE]

By obtaining formulas of $E(L_{c}^{l})$ and $E(L_{c}^{r})$ , the M-step is done by derivation of each expected log-likelihoods.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) I. Jolliffe, Principal component analysis, Wiley Online Library, 2002.
2(2) H. Hotelling, Relations between two sets of variates, Biometrika (1936) 321–377.
3(3) C.-C. Jia, S.-J. Wang, X.-J. Peng, W. Pang, C.-Y. Zhang, C.-G. Zhou, Z.-Z. Yu, Incremental multi-linear discriminant analysis using canonical correlations for action recognition, Neurocomputing 83 (2012) 56–63.
4(4) Y. R. Wang, K. Jiang, L. J. Feldman, P. J. Bickel, H. Huang, Inferring gene association networks using sparse canonical correlation analysis, ar Xiv preprint ar Xiv:1401.6504.
5(5) S. Huang, J. Chen, Z. Luo, Sparse tensor cca for color face recognition, Neural Computing and Applications 24 (7-8) (2014) 1647–1658.
6(6) J. Ye, R. Janardan, Q. Li, Gpca: an efficient dimension reduction scheme for image compression and retrieval, in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2004, pp. 354–363.
7(7) D. Zhang, Z.-H. Zhou, (2d) 2pca: Two-directional two-dimensional pca for efficient face representation and recognition, Neurocomputing 69 (1) (2005) 224–231.
8(8) S. H. Lee, S. Choi, Two-dimensional canonical correlation analysis, Signal Processing Letters, IEEE 14 (10) (2007) 735–738.