Face Recognition using Multi-Modal Low-Rank Dictionary Learning
Homa Foroughi, Moein Shakeri, Nilanjan Ray, Hong Zhang

TL;DR
This paper introduces a multi-modal low-rank dictionary learning approach for face recognition that effectively handles occlusion and illumination variations by fusing raw pixel data with illumination-invariant features.
Contribution
It presents a novel multi-modal structured low-rank dictionary learning method that improves robustness and discriminability in face recognition under challenging conditions.
Findings
Outperforms existing methods on various datasets.
Robust to severe illumination changes and occlusion.
Effective with limited training samples.
Abstract
Face recognition has been widely studied due to its importance in different applications; however, most of the proposed methods fail when face images are occluded or captured under illumination and pose variations. Recently several low-rank dictionary learning methods have been proposed and achieved promising results for noisy observations. While these methods are mostly developed for single-modality scenarios, recent studies demonstrated the advantages of feature fusion from multiple inputs. We propose a multi-modal structured low-rank dictionary learning method for robust face recognition, using raw pixels of face images and their illumination invariant representation. The proposed method learns robust and discriminative representations from contaminated face images, even if there are few training samples with large intra-class variations. Extensive experiments on different datasets…
| Method | Sunglasses | Scarf | Mixed | Misc. |
|---|---|---|---|---|
| MLDL [9] | 90.51 | 91.51 | 91.32 | 76.33 |
| UMD2L [8] | 88.26 | 87.40 | 88.30 | 71.30 |
| MSDL [15] | 83.20 | 80.65 | 79.50 | 68.44 |
| D2L2R2 [4] | 92.20 | 90.40 | 91.30 | 75.30 |
| SLRDL [5] | 87.35 | 83.40 | 82.47 | 72.30 |
| JP-LRDL [3] | 93.20 | 93.00 | 93.30 | 78.23 |
| AlexNet [16] | 30.33 | 30.12 | 30.17 | 25.55 |
| VGG-Face [17] | 85.90 | 85.01 | 87.30 | 79.83 |
| MM-SLDL | 96.70 | 96.41 | 96.30 | 85.30 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Face recognition and analysis · Biometric Identification and Security
FACE RECOGNITION USING MULTI-MODAL LOW-RANK DICTIONARY LEARNING
Abstract
Face recognition has been widely studied due to its importance in different applications; however, most of the proposed methods fail when face images are occluded or captured under illumination and pose variations. Recently several low-rank dictionary learning methods have been proposed and achieved promising results for noisy observations. While these methods are mostly developed for single-modality scenarios, recent studies demonstrated the advantages of feature fusion from multiple inputs. We propose a multi-modal structured low-rank dictionary learning method for robust face recognition, using raw pixels of face images and their illumination invariant representation. The proposed method learns robust and discriminative representations from contaminated face images, even if there are few training samples with large intra-class variations. Extensive experiments on different datasets validate the superior performance and robustness of our method to severe illumination variations and occlusion.
**Index Terms— ** Multi-modal dictionary learning, Low-rank learning, Illumination invariant, Face recognition
1 INTRODUCTION
The last decade has witnessed a tremendous progress in face recognition technologies, and great recognition performance has been reported by different methods under some ideal conditions, but most of these methods are not robust to outliers, occlusions, severe illumination and pose variations. In recent years, dictionary learning (DL) algorithms have been successfully applied to different vision tasks including face recognition. DL is a feature learning technique in which, an input signal is represented with a sparse linear combination of dictionary atoms. To alleviate the effects of aforementioned variations, low-rank (LR) matrix recovery has been integrated into the DL framework, and is shown to achieve promising results when corruption existed. LR matrix recovery [1] was oroginally proposed to recover a LR matrix from corrupted observations, and have succesfuly been applied to applications like background modeling [2] and image classification [3]. Li et al. [4] developed a discriminative DL method by combination of the Fisher discrimination and the LR constraint on sub-dictionaries. Zhang et al. [5] presented a structured, sparse and LR representation for image classification by adding a regularization term to the DL objective function. Recently, Foroughi et al. [3] proposed a joint projection and LR-DL method using dual graph constraints for classification of small datasets, which include considerable amount of variations.
In parallel developments, it is well established that information fusion using multiple sources can generally improve the recognition performance, since it provides a framework to combine information from different perspectives that is more tolerant to the errors of individual sources [6]. To benefit from information fusion, some methods have also successfully incorporated DL technique into the feature learning framework. Monaci et al. [7] proposed a multi-modal DL algorithm to extract typical templates, which represents synchronous transient structures between multi-modal features. [8] proposed an uncorrelated multi-view discrimination DL method based on the Fisher discrimination, that jointly learns multiple uncorrelated discriminative dictionaries from different views. Nevertheless, the only work that integrated LR into multi-modal DL was presented by Wu et al. [9] through constructing class-specific sub-dictionaries for each modality, and utilizing LR and incoherence constraints on each view.
To construct different modalities, most of the existing methods either exploit multi-view angles [9] or extract different local features [9] or weak biometrics [10] from pre-defined regions of face images. These methods suffer from two main disadvantages that burden extra overhead on the system. Firstly, they demand either several cameras or manual region definition and hand-crafted feature extraction and secondly, they are not applicable to millions of available face data that have already been captured under single view. By exploiting more meaningful modalities, we address these challenges and even increase the recognition rate further. Recently, Shakeri et al. [11] presented an illumination invariant representation of an image for outdoor place recognition. To create this representation, they use a Wiener filter derived from the power law spectrum assumption of natural images that is robust against illumination variations. Since the obtained representation may lose the chromaticity of the image, a shadow removal method based on entropy minimization is utilized. This representation showed superior performance for outdoor place recognition in various illumination and shadow variations. Inspired by this success, we design a framework for multi-modal fusion with the following contributions:
- •
We design a multi-modal LR-DL method, where in each modality a discriminative and reconstructive dictionary, and a structured sparse and LR representation are learned from face images, and the collaboration between modalities is encouraged by incorporating an ideal representation term. We provide a new classification schema, which utilizes the reconstruction by LR and sparse noise components.
- •
By adopting illumination invariant representation of images as one of the modalities, the model learns robust and discriminative representations from noisy images, even when the kind of variation is different in the training and test sets. The proposed method achieves superior performance for small datasets that have large intra-class variation.
2 THE PROPOSED MM-SLDL METHOD
We propose a Multi-Modal Strcutured Low-rank Dictionary Learning method (MM-SLDL) for face recognition, in which we use two modalities. While the first modality is constructed by the raw pixels of face images, the second is formed by illumination invariant images [11]. Denote the training data from the modality including classe, as , where corresponds to class in the modality. In each modality, we use a supervised learning method to learn a discriminative and reconstructive dictionary , and a structural sparse and LR image representation . LR matrix recovery helps to decompose the corrupted matrix into a LR component and a sparse noise component , i.e., . With respect to dictionary , the optimal representation matrix for should be block-diagonal [12], i.e., . In each modality, the dictionary contain sub-dictionaries as , where corresponds to the class. Let be the representation of with respect to , then denotes coefficients for . To learn robust representations from images, should have discriminative and reconstructive power. Firstly, should well represent the samples in class , and ideally be exclusive to each subject . Secondly, every class needs to be well represented by its sub-dictionary, such that , and finally , the coefficients for , are nearly all zero. So, the objective function of MM-SLDL is defined as:
[TABLE]
The main objective function simultaneously trains two dictionaries and representations under the joint ideal regularization prior. is an ideal representation built from training data in block-diagonal form, and defined as . Here is the size of dictionary, and is the code for sample in the form of , where is the number of training samples in class . This means that if belongs to class L, then the coefficients in for are all s, while the others are all [math]s. We add the regularization term for two reasons: first, to include the structure information into the dictionary learning process and second, to enforce collaboration between two modalities. It encourages the training images of the same class to have the same representation in different modalities, despite of intra-class variations.
Classification Scheme: After dictionaries are learned, the LR sparse representations of training data and of test data are calculated by solving (2) separately using Algorithm 1 with . The representation of the test sample is the column of . Using the multivariate ridge regression model [13], we obtain a linear classifier as:
[TABLE]
where is the class label matrix of . This yields . The estimated label of the modality is obtained as:
[TABLE]
where is the class label vector. We then use LR matrix recovery to obtain LR and sparse noise components of potential classes , and then compute the reconstruction error of the given query sample in both modalities:
[TABLE]
where is the LR component of class in the modality, and is the average sparse noise of that class. Since the data range is different in two modalities, we use a normalization step, and the winner class is the one that minimizes the ratio of (4) between two modalities.
3 OPTIMIZATION OF MM-SLDL
In each iteration we update the variables of the modality, while fixing the variables of other modality, and for the modality, the variables are updated alternatively. To solve optimization problem (2), we first introduce an auxiliary variable to make it separable:
[TABLE]
The augmented Lagrangian function of (5) is defined as:
[TABLE]
[TABLE]
where , and are Lagrange multipliers and is a balance parameter. The optimization problem (3) can be divided into two sub-problems as follows:
- •
Updating Coding Coefficient : With fixed, we use the linearized alternating direction method with adaptive penalty (LADMAP) [14] to solve for and . The augmented Lagrangian function (3) would reduce to:
[TABLE]
The function (• ‣ 3) should be minimized by alternative updating variables as follows:
[TABLE]
where , and we notice that .
[TABLE]
[TABLE]
- •
Updating Dictionary : When are fixed, we would be able to update . The Lagrangian function (3) is further reduced to:
[TABLE]
where is fixed. Equation (• ‣ 3) is in the quadratic form and can be solved directly as follows:
[TABLE]
where . We initialize the dictionary using KSVD method on training samples of each class and combining all the classes.
4 RESULTS AND DISCUSSION
The performance of MM-SLDL method is evaluated on three face datasets. We compare our method with three types of methods: (1)Multi-modal LR-DL method MLDL [9] (2)Multi-modal DL methods including UMD2L [8] and MSDL [15] (3)Single-modality LR-DL methods such as JP-LRDL [3], D2L2R2 [4] and SLRDL [5]. For constructing the training set, we select images randomly and the selection is repeated times and we report the average recognition rates for all methods. We set the number of dictionary atoms of each class as training size, and choose the tuning parameters of all methods by 5-fold cross validation.
Convolutional Neural Networks (CNNs) have significantly improved the face recognition rates, and the most important ingredient for the success of such methods is the availability of large quantities of training data; however, transfer learning is a powerful tool to train small target datasets. [18] revealed when the target dataset is small and similar to original dataset, it is better to treat CNN as fixed feature extractor and train a linear classifier on the CNN features. We compare MM-SLDL with two deep methods (1)Deep features generated by VGG-Face descriptor [17], that is based on a -layer CNN trained on M images, followded by a nearest neighbor classifier. (2)We use -layer AlexNet [16] trained on M images of the ImageNet dataset, and fine-tune it on the target data.
AR Dataset [19] includes over face images from individuals, images for each person in two sessions. Among the images of each session, are obscured by scarves, by sunglasses, and the remaining faces are of different facial expressions or illumination variations, which we refer to as unobscured images. Following [5], experiments are conducted under three scenarios. Sunglasses: We select unobscured images and with sunglasses from the first session as training samples for each person, and the rest of unobscured and sunglasses images are used for testing. Scarf: We choose training images ( unobscured and with scarf) from the first session for training, and test images including the rest of unobscured and scarf images. Mixed: We select unobscured, plus occluded images ( with sunglasses, by scarf) from the first session for training, and the remaining images in two sessions for testing. We design a challenging scenario Misc., in which we select unobscured, and scarf images from the first session for training, and utilize the remaining unobscured and sunglasses images for testing. Here, the type of noise is different in training and test sets. According to Table 1, MM-SLDL achieves the best performance in all scenarios, and the improvement is significant in “Misc.” scenario, where all the other methods fail.
Fig. 1b illustrates examples of image decomposition on AR dataset. The first and second rows show training images and learned LR component in two modalities. While the first modality keeps more details, the illumination invariant modality better separates occlusions from the original images; hence, a robust representation is learned by their fusion. Fig. 1a demonstrates a testing sample , and components, which their difference determines the winner, that is illustrated by a red tick mark.
Extended YaleB Dataset [20] contains face images of human subjects captured under different illumination conditions. There are images for each subject, and we randomly select of them for training. We simulate various levels of contiguous occlusion from to , by replacing a randomly located square block of each train image with an unrelated image, as seen in Fig. 3a. To have a real challenge, test images are not occluded. We visualize the representation for two modalities for testing images of the first classes under occlusion training scenario in Fig. 2a. Testing images automatically generate a block diagonal structure, and the second modality learns a better representation here. Fig. 2b illustrates the recognition rates of all methods across different occlusion levels, and MM-SLDL outperforms other counterparts, especially for severely occluded images.
LFW Dataset [21] contains unconstrained face images of different individuals, collected from the web with large variations in pose, expression, illumination, clothing, hairstyles, occlusion, etc. We use an aligned version of LFW called LWFa [22], and exploit subject with no less than samples per subject to perform the experiment. Some of these images are shown in Fig. 3b. A central region is cropped from each of images and the first samples per class are selected for training, while the rest is used for testing. Table 2 shows the recognition rates of all compared methods. Although VGG-Face has already been trained on extra M images, MM-SLDL achieves competitive results using too much smaller training data, which have large intra-class variation. Also, as expected fine-tuned AlexNet is prone to overfitting because target data is small and very different in content compared to the ImageNet. Finally, to verify the role of illumination invariant modality, we just use in the objective function (2) and change the ideal representation term to . The results are reported under SLDL-Mod2, and as observed is not competitive.
5 CONCLUSIONS
We proposed a face recognition method that learns discriminative dictionaries and structured sparse LR representations from contaminated face image in two modalities. Adopting the illumination invariant representation of images as a modality, additionally empowers the model. Experimental results indicate that MM-SLDL is robust, achieving state-of-art performance in the presence of occlusion, illumination and pose changes, using a few training samples.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright, “Robust principal component analysis?,” Journal of the ACM (JACM) , vol. 58, no. 3, pp. 11, 2011.
- 2[2] Moein Shakeri and Hong Zhang, “Corola: a sequential solution to moving object detection using low-rank approximation,” Computer Vision and Image Understanding , vol. 146, pp. 27–39, 2016.
- 3[3] Homa Foroughi, Nilanjan Ray, and Hong Zhang, “Object classification with joint projection and low-rank dictionary learning,” ar Xiv preprint ar Xiv:1612.01594 , 2016.
- 4[4] Liangyue Li, Sheng Li, and Yun Fu, “Discriminative dictionary learning with low-rank regularization for face recognition,” in Automatic Face and Gesture Recognition, 2013 10th IEEE International Conference and Workshops on . IEEE, 2013, pp. 1–6.
- 5[5] Yangmuzi Zhang, Zhuolin Jiang, and Larry Davis, “Learning structured low-rank representations for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2013, pp. 676–683.
- 6[6] David L Hall and James Llinas, “An introduction to multisensor data fusion,” Proceedings of the IEEE , vol. 85, no. 1, pp. 6–23, 1997.
- 7[7] Gianluca Monaci, Philippe Jost, Pierre Vandergheynst, Boris Mailhe, Sylvain Lesage, and Rémi Gribonval, “Learning multi-modal dictionaries: Application to audiovisual data,” in International Workshop on Multimedia Content Representation, Classification and Security . Springer, 2006, pp. 538–545.
- 8[8] Xiao-Yuan Jing, Ruimin Hu, Fei Wu, Xi-Lin Chen, Qian Liu, and Yong-Fang Yao, “Uncorrelated multi-view discrimination dictionary learning for recognition.,” in AAAI , 2014, pp. 2787–2795.
