Deep-Emotion: Facial Expression Recognition Using Attentional   Convolutional Network

Shervin Minaee; Amirali Abdolrashidi

arXiv:1902.01019·cs.CV·February 5, 2019

Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network

Shervin Minaee, Amirali Abdolrashidi

PDF

3 Repos

TL;DR

This paper introduces Deep-Emotion, an attentional convolutional network for facial expression recognition that outperforms previous models by focusing on key facial regions and utilizing visualization techniques to interpret emotion-specific features.

Contribution

The work presents a novel attentional convolutional network that enhances facial expression recognition accuracy and provides insights into emotion-related facial regions.

Findings

01

Significant improvement over previous models on multiple datasets

02

Different emotions are associated with distinct facial regions

03

Visualization techniques reveal important face areas for emotion detection

Abstract

Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos. Most of these works perform reasonably well on datasets of images captured in a controlled condition, but fail to perform as good on more challenging datasets with more image variation and partial faces. In recent years, several works proposed an end-to-end framework for facial expression recognition, using deep learning models. Despite the better performance of these works, there still seems to be a great room for improvement. In this work, we propose a deep learning approach based on attentional convolutional network, which is able to focus on important parts…

Tables4

Table 1. TABLE I : Classification Accuracies on FER 2013 dataset

Method	Accuracy Rate
Bag of Words [52]	67.4%
VGG+SVM [53]	66.31%
GoogleNet [54]	65.2%
Mollahosseini et al [19]	66.4%
The proposed algorithm	70.02%

Table 2. TABLE II : Classification Accuracy on FERG dataset

Method	Accuracy Rate
DeepExpr [2]	89.02%
Ensemble Multi-feature [49]	97%
Adversarial NN [48]	98.2%
The proposed algorithm	99.3%

Table 3. TABLE III : Classification Accuracy on JAFFE dataset

Method	Accuracy Rate
Fisherface [47]	89.2%
Salient Facial Patch [46]	91.8%
CNN+SVM [50]	95.31%
The proposed algorithm	92.8%

Table 4. TABLE IV : Classification Accuracy on CK+

Method	Accuracy Rate
MSR [39]	91.4%
3DCNN-DAP [40]	92.4%
Inception [19]	93.2%
IB-CNN [41]	95.1%
IACNN [42]	95.37%
DTAGN [43]	97.2%
ST-RNN [44]	97.2%
PPDN [45]	97.3%
The proposed algorithm	98.0%

Equations2

L_{o v er a l l} = L_{c l a ss i f i er} + λ ∥ w_{(f c)} ∥_{2}^{2}

L_{o v er a l l} = L_{c l a ss i f i er} + λ ∥ w_{(f c)} ∥_{2}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network

Shervin Minaee1, Amirali Abdolrashidi2

1Expedia Group

2University of California, Riverside

Abstract

Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos. Most of these works perform reasonably well on datasets of images captured in a controlled condition, but fail to perform as good on more challenging datasets with more image variation and partial faces. In recent years, several works proposed an end-to-end framework for facial expression recognition, using deep learning models. Despite the better performance of these works, there still seems to be a great room for improvement. In this work, we propose a deep learning approach based on attentional convolutional network, which is able to focus on important parts of the face, and achieves significant improvement over previous models on multiple datasets, including FER-2013, CK+, FERG, and JAFFE. We also use a visualization technique which is able to find important face regions for detecting different emotions, based on the classifier’s output. Through experimental results, we show that different emotions seems to be sensitive to different parts of the face.

I Introduction

Emotions are an inevitable portion of any inter-personal communication. They can be expressed in many different forms which may or may not be observed with the naked eye. Therefore, with the right tools, any indications preceding or following them can be subject to detection and recognition. There has been an increase in the need to detect a person’s emotions in the past few years. There has been interest in human emotion recognition in various fields including, but not limited to, human-computer interface [1], animation [2], medicine [3, 4] and security [5, 6].

Emotion recognition can be performed using different features, such as face [2, 19, 20], speech [23, 5], EEG [24], and even text [25]. Among these features, facial expressions are one of the most popular, if not the most popular, due to a number of reasons; they are visible, they contain many useful features for emotion recognition, and it is easier to collect a large dataset of faces (than other means for human recognition) [2, 38, 37].

Recently, with the use of deep learning and especially convolutional neural networks (CNNs) [32], many features can be extracted and learned for a decent facial expression recognition system [7, 18]. It is, however, noteworthy that in the case of facial expressions, much of the clues come from a few parts of the face, e.g. the mouth and eyes, whereas other parts, such as ears and hair, play little part in the output [33]. This means that ideally, the machine learning framework should focus only on the important parts of the face, and less sensitive to other face regions.

In this work we propose a deep learning based framework for facial expression recognition, which takes the above observation into account, and uses attention mechanism to focus on the salient part of the face. We show that by using attentional convolutional network, even a network with few layers (less than 10 layers) is able to achieve very high accuracy rate. More specifically, this paper presents the following contributions:

•

We propose an approach based on an attentional convolutional network, which can focus on feature-rich parts of the face, and yet, outperform remarkable recent works in accuracy.

•

In addition, we use the visualization technique proposed in [34] to highlight the face image’s most salient regions, i.e. the parts of the image which have the strongest impact on the classifier’s outcome. Samples of salient regions for different emotions are shown in Figure 1.

In the following sections, we first provide an overview of related works in Section II. The proposed framework and model architecture are explained in Section III. We will then provide the experimental results, overview of databases used in this work, and also model visualization in Section IV. Finally we conclude the paper in Section V.

II Related Works

In one of the most iconic works in emotion recognition by Paul Ekman [35], happiness, sadness, anger, surprise, fear and disgust were identified as the six principal emotions (besides neutral). Ekman later developed FACS [36] using this concept, thus setting the standard for works on emotion recognition ever since. Neutral was also included later on, in most of human recognition datasets, resulting in seven basic emotions. Image samples of these emotions from three datasets are displayed in Figure 2.

Earlier works on emotion recognition, rely on the traditional two-step machine learning approach, where in the first step, some features are extracted from the images, and in the second step, a classifier (such as SVM, neural network, or random forest) are used to detect the emotions. Some of the popular hand-crafted features used for facial expression recognition include the histogram of oriented gradients (HOG) [26, 28], local binary patterns (LBP) [27], Gabor wavelets [31] and Haar features [30]. A classifier would then assign the best emotion to the image. These approaches seemed to work fine on simpler datasets, but with the advent of more challenging datasets (which have more intra-class variation), they started to show their limitation. To get a better sense of some of the possible challenges with the images, we refer the readers to the images in the first row of Figure 2, where the image can have partial face, or the face can be occluded with hand or eye-glasses.

With the great success of deep learning, and more specifically convolutional neural networks for image classification and other vision problems [9, 10, 11, 12, 13, 14, 16, 15], several groups developed deep learning-based models for facial expression recognition (FER). To name some of the promising works, Khorrami in [7] showed that CNNs can achieve a high accuracy in emotion recognition and used a zero-bias CNN on the extended Cohn-Kanade dataset (CK+) and the Toronto Face Dataset (TFD) to achieve state-of-the-art results. Aneja et al [2] developed a model of facial expressions for stylized animated characters based on deep learning by training a network for modeling the expression of human faces, one for that of animated faces, and one to map human images into animated ones. Mollahosseini [19] proposed a neural network for FER using two convolution layers, one max pooling layer, and four “inception” layers, i.e. sub-networks. Liu in [20] combines the feature extraction and classification in a single looped network, citing the two parts’ need for feedback from each other. They used their Boosted Deep Belief Network (BDBN) on CK+ and JAFFE, achieving state-of-the-art accuracy. Barsoum et al [21] worked on using a deep CNN on noisy labels acquired via crowd-sourcing for ground truth images. They used 10 taggers to re-label each image in the dataset, and used various cost functions for their DCNN, achieving decent accuracy. Han et al [41] proposed an incremental boosting CNN (IB-CNN) in order to improve the recognition of spontaneous facial expressions by boosting the discriminative neurons, improving over the best methods of the time. Meng in [42] proposed an identity-aware CNN (IA-CNN) which used identity- and expression-sensitive contrastive losses to reduce the variations in learning identity- and expression-related information.

All of the above works achieve significant improvements over the traditional works on emotion recognition, but there seems to be missing a simple piece for attending to the important face regions for emotion detection. In this work, we try to address this problem, by proposing a framework based on attentional convolutional network, which is able to focus on salient face regions.

III The Proposed Framework

We propose an end-to-end deep learning framework, based on attentional convolutional network, to classify the underlying emotion in the face images. Often times, improving a deep neural network relies on adding more layers/neurons, facilitating gradient flow in the network (e.g. by adding adding skip layers), or better regularizations (e.g. spectral normalization), especially for classification problems with a large number of classes. However, for facial expression recognition, due to the small number of classes, we show that using a convolutional network with less than 10 layers and attention (which is trained from scratch) is able to achieve promising results, beating state-of-the-art models in several databases.

Given a face image, it is clear that not all parts of the face are important in detecting a specific emotion, and in many cases, we only need to attend to the specific regions to get a sense of the underlying emotion. Based on this observation, we add an attention mechanism, through spatial transformer network into our framework to focus on important face regions.

Figure 3 illustrates the proposed model architecture. The feature extraction part consists of four convolutional layers, each two followed by max-pooling layer and rectified linear unit (ReLU) activation function. They are then followed by a dropout layer and two fully-connected layers. The spatial transformer (the localization network) consists of two convolution layers (each followed by max-pooling and ReLU), and two fully-connected layers. After regressing the transformation parameters, the input is transformed to the sampling grid $T(\theta)$ producing the warped data. The spatial transformer module essentially tries to focus on the most relevant part of the image, by estimating a sample over the attended region. One can use different transformations to warp the input to the output, here we used an affine transformation which is commonly used for many applications. For further details about the spatial transformer network, please refer to [17]

This model is then trained by optimizing a loss function using stochastic gradient descent approach (more specifically Adam optimizer). The loss function in this work is simply the summation of two terms, the classification loss (cross-entropy), and the regularization term (which is $\ell_{2}$ norm of the weights in the last two fully-connected layers.

[TABLE]

The regularization weight, $\lambda$ , is tuned on the validation set. Adding both dropout and $\ell_{2}$ regularization enables us to train our models from scratch even on very small datasets, such as JAFFE and CK+. It is worth mentioning that we train a separate model for each one of the databases used in this work. We also tried a network architecture with more than 50 layers, but the accuracy did not improve much. Therefore the simpler model shown here was used in the end.

IV Experimental Results

In this section we provide the detailed experimental analysis of our model on several facial expression recognition databases. We first provide a brief overview the databases used in this work, we then provide the performance of our models on four databases and compare the results with some of the promising recent works. We then provide the salient regions detected by our trained model using a visualization technique.

IV-A Databases

In this work, we provide the experimental analysis of the proposed model on several popular facial expression recognition datasets, including FER2013 [37], the extended Cohn-Kanade [22], Japanese Female Facial Expression (JAFFE) [38], and Facial Expression Research Group Database (FERG) [2]. Before diving into the results, we are going to give a brief overview of these databases.

FER2013: The Facial Expression Recognition 2013 (FER2013) database was first introduced in the ICML 2013 Challenges in Representation Learning [37]. This dataset contains 35,887 images of 48x48 resolution, most of which are taken in wild settings. Originally the training set contained 28,709 images,and validation and test each include 3,589 images. This database was created using the Google image search API and faces are automatically registered. Faces are labeled as any of the six cardinal expressions as well as neutral. Compared to the other datasets, FER has more variation in the images, including face occlusion (mostly with hand), partial faces, low-contrast images, and eyeglasses. Four sample images from FER dataset are shown in Figure 4.

CK+: The extended Cohn-Kanade (known as CK+) facial expression database [22] is a public dataset for action unit and emotion recognition. It includes both posed and non-posed (spontaneous) expressions. The CK+ comprises a total of 593 sequences across 123 subjects. In most of previous works, the last frame of these sequences are taken and used for image based facial expression recognition. Six sample images from this dataset are shown in Figure 5.

JAFFE: This dataset contains 213 images of the 7 facial expressions posed by 10 Japanese female models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects [38]. Four sample images from this dataset are shown in Figure 6.

FERG: FERG is a database of stylized characters with annotated facial expressions. The database contains 55,767 annotated face images of six stylized characters. The characters were modeled using MAYA. The images for each character are grouped into seven types of expressions [2]. Six sample images from this database are shown in Figure 7. We mainly wanted to try our algorithm on this database to see how it performs on cartoonish characters.

IV-B Experimental Analysis and Comparison

We will now present the performance of the proposed model on the above datasets. In each case, we train the model on a subset of that dataset and validate on validation set, and report the accuracy over the test set.

Before getting into the details of the model’s performance on different datasets, we briefly discuss our training procedure. We trained one model per dataset in our experiments, but we tried to keep the architecture and hyper-parameters similar among these different models. Each model is trained for 500 epochs from scratch, on an AWS EC2 instance with a Nvidia Tesla K80 GPU. We initialize the network weights with random Gaussian variables with zero mean and 0.05 standard deviation. For optimization, we used Adam optimizer with a learning rate of 0.005 with weight decay (Different optimizer were tried, including stochastic gradient descents, and Adam seemed to be performing slightly better). It takes around 2-4 hours to train our models on FER and FERG datasets. For JAFFE and CK+, since there are much fewer images, it takes less than 10 minutes to train a model. Data augmentation is used for the images in the training sets to train the model on a larger number of images, and make the trained model for invariant on small transformations.

As discussed before, FER-2013 dataset is more challenging than other facial expression recognition datasets we used. Besides the intra-class variation of FER, another main challenge in this dataset is the imbalance nature of different emotion classes. Some of the classes such as happiness and neutral have a lot more examples than others. We used the entire 28,709 images in the training set to train the model, validated on 3.5k validation images, and report the model accuracy on the 3,589 images in the test set. We were able to achieve an accuracy rate of around 70.02% on the test set. The confusion matrix on the test set of FER dataset is shown in Figure 8. As we can see, the model is making more mistakes for classes with less samples such as disgust and fear.

The comparison of the result of our model with some of the previous works on FER 2013 are provided in Table I.

For FERG dataset, we use around 34k images for training, 14k for validation, and 7k for testing. For each facial expression, we randomly select 1k images for testing. We were able to achieve an accuracy rate of around 99.3%. The confusion matrix on the test set of FERG dataset is shown in Figure 9.

The comparison between the proposed algorithm and some of the previous works on FERG dataset are provided in Table II.

For JAFFE dataset, we use 120 images for training, 23 images for validation, and 70 images for test (10 images per emotion in the test set). The confusion matrix of the predicted results on this dataset is shown in Figure 10. The overall accuracy on this dataset is around 92.8%.

The comparison with previous works on JAFFE dataset are shown in Table III.

For CK+, 70% of the images are used as training, 10% for validation, and 20% for testing. The comparison of our model with previous works onthe extended CK dataset are shown in Table IV.

IV-C Model Visualization

Here we provide a simple approach to visualize the important regions while classifying different facial expression, inspired by the work in [34]. We start from the top-left corner of an image, and each time zero out a square region of size $N$ x $N$ inside the image, and make a prediction using the trained model on the occluded image. If occluding that region makes the model to make a wrong prediction on facial expression label, that region would be considered as a potential region of importance in classifying the specific expression. On the other hand, if removing that region would not impact the model’s prediction, we infer that region is not very important in detecting the corresponding facial expression. Now if we repeat this procedure for different sliding windows of $N$ x $N$ , each time shifting them with a stride of $s$ , we can get a saliency map for the most important regions in detecting an emotion from different images.

We show nine example cluttered images for a happy and an angry image from JAFFE dataset, and how zeroing out different regions would impact the model prediction. As we can see, for the happy face zeroing out the areas around mouth would cause the model to make a wrong prediction, whereas for angry face, zeroing out the areas around eye and eyebrow makes the model to make a mistake.

Figure 12, shows the important regions of 7 sample images from JAFFE dataset, each corresponding to a different emotion. There are some interesting observations from these results. For example, for the sample image with neutral emotion in the fourth row, the saliency region essentially covers the entire face, which means that all these regions are important to infer that a given image has neutral facial expression. This makes sense, since changes in any parts of the face (such as eyes, lips, eyebrows and forehead) could lead to a different facial expression, and the algorithm needs to analyze all those parts in order to correctly classify a neutral image. This is however not the case for most of the other emotions, such as happiness, and fear, where the areas around the mouth turns out to be more important than other regions.

It is worth mentioning that different images with the same facial expression could have different saliency maps due to the different gestures and variations in the image. In Figure 13, we show the important regions for three images with facial expression of “fear”. As it can be seen from this figure, the important regions for these images are very similar in detecting the mouth, but the last one also considers some part of forehead in the important region. This could be because of the strong presence of forehead lines, which is not visible in the two other images.

V Conclusion

This paper proposes a new framework for facial expression recognition using an attentional convolutional network. We believe attention is an important piece for detecting facial expressions, which can enable neural networks with less than 10 layers to compete with (and even outperform) much deeper networks for emotion recognition. We also provided an extensive experimental analysis of our work on four popular facial expression recognition databases, and showed promising results. Also, we have deployed a visualization method to highlight the salient regions of face images which are the most crucial parts thereof in detecting different facial expressions.

Acknowledgment

We would like to express our gratitude to the people at University of Washington Graphics and Imaging Lab (GRAIL) for providing us with access to the FERG database. We would also like to thank our colleagues and partners for reviewing our work, and providing very useful comments and suggestions.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Cowie, Roddy, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G. Taylor. ”Emotion recognition in human-computer interaction.” IEEE Signal processing magazine 18, no. 1: 32-80, 2001.
2[2] Aneja, Deepali, Alex Colburn, Gary Faigin, Linda Shapiro, and Barbara Mones. ”Modeling stylized character expressions via deep learning.” In Asian Conference on Computer Vision, pp. 136-153. Springer, Cham, 2016.
3[3] Edwards, Jane, Henry J. Jackson, and Philippa E. Pattison. ”Emotion recognition via facial expression and affective prosody in schizophrenia: a methodological review.” Clinical psychology review 22.6: 789-832, 2002.
4[4] Chu, Hui-Chuan, William Wei-Jen Tsai, Min-Ju Liao, and Yuh-Min Chen. ”Facial emotion recognition with transition detection for students with high-functioning autism in adaptive e-learning.” Soft Computing: 1-27, 2017.
5[5] Clavel, Chloé, Ioana Vasilescu, Laurence Devillers, Gaël Richard, and Thibaut Ehrette. ”Fear-type emotion recognition for future audio-based surveillance systems.” Speech Communication 50, no. 6: 487-503, 2008.
6[6] Saste, Sonali T., and S. M. Jagdale. ”Emotion recognition from speech using MFCC and DWT for security system.” In Electronics, Communication and Aerospace Technology (ICECA), 2017 International conference of, vol. 1, pp. 701-704. IEEE, 2017.
7[7] Khorrami, Pooya, Thomas Paine, and Thomas Huang. ”Do deep neural networks learn facial action units when doing expression recognition?.” Proceedings of the IEEE International Conference on Computer Vision Workshops. 2015.
8[8] Kahou, Samira Ebrahimi, Xavier Bouthillier, Pascal Lamblin et al. ”Emonets: Multimodal deep learning approaches for emotion recognition in video.” Journal on Multimodal User Interfaces 10, no. 2: 99-111, 2016.