Bounded Residual Gradient Networks (BReG-Net) for Facial Affect Computing
Behzad Hasani, Pooran Singh Negi, Mohammad H. Mahoor

TL;DR
This paper introduces BReG-Net, a novel neural network architecture with bounded residual gradients for improved facial expression recognition, addressing gradient issues and class imbalance to outperform existing methods on multiple datasets.
Contribution
The paper proposes BReG-Net with bounded residual gradients and a weighted loss function, enhancing generalization and performance in facial affect recognition tasks.
Findings
BReG-Net outperforms state-of-the-art methods on three facial databases.
Bounded residual gradients enable shallower networks with better accuracy.
Weighted loss improves recognition of underrepresented categories.
Abstract
Residual-based neural networks have shown remarkable results in various visual recognition tasks including Facial Expression Recognition (FER). Despite the tremendous efforts have been made to improve the performance of FER systems using DNNs, existing methods are not generalizable enough for practical applications. This paper introduces Bounded Residual Gradient Networks (BReG-Net) for facial expression recognition, in which the shortcut connection between the input and the output of the ResNet module is replaced with a differentiable function with a bounded gradient. This configuration prevents the network from facing the vanishing or exploding gradient problem. We show that utilizing such non-linear units will result in shallower networks with better performance. Further, by using a weighted loss function which gives a higher priority to less represented categories, we can achieve an…
|
|
|
|
|
|
|||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AffectNet | 57.37 | 58.83 | 59.43 | 64.02 | 60.03 | 63.54 | ||||||||||||
| FER2013 | 65.80 | 66.21 | 65.16 | 67.66 | 68.74 | 69.49 | ||||||||||||
| F1-score | kappa | alpha | MCC | PPV | ||||||
| Orig* | Norm* | Orig | Norm | Orig | Norm | Orig | Norm | Orig | Norm | |
| AffectNet | 0.63 | 0.68 | 0.58 | 0.63 | 0.58 | 0.64 | 0.59 | 0.64 | 0.62 | 0.71 |
| FER2013 | 0.67 | 0.67 | 0.62 | 0.62 | 0.62 | 0.62 | 0.62 | 0.62 | 0.69 | 0.68 |
| *Orig and Norm stand for Original and skew-Normalized, respectively. | ||||||||||
| F1-score | kappa | alpha | MCC | PPV | ||||||
| Orig* | Norm* | Orig | Norm | Orig | Norm | Orig | Norm | Orig | Norm | |
| AffectNet | 0.58 | 0.60 | 0.52 | 0.54 | 0.52 | 0.54 | 0.52 | 0.54 | 0.58 | 0.62 |
| FER2013 | 0.67 | 0.68 | 0.61 | 0.62 | 0.61 | 0.62 | 0.62 | 0.62 | 0.69 | 0.69 |
| *Orig and Norm stand for Original and skew-Normalized, respectively. | ||||||||||
| CC | CCC | SAGR | ||||
| valence | arousal | valence | arousal | valence | arousal | |
| AffectNet | 0.66 | 0.84 | 0.66 | 0.82 | 0.73 | 0.84 |
| Affect-in-Wild | 0.45 | 0.40 | 0.43 | 0.34 | 0.63 | 0.77 |
| CC | CCC | SAGR | ||||
| valence | arousal | valence | arousal | valence | arousal | |
| AffectNet | 0.66 | 0.84 | 0.63 | 0.82 | 0.66 | 0.84 |
| Affect-in-Wild | 0.41 | 0.41 | 0.38 | 0.35 | 0.61 | 0.75 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
Bounded Residual Gradient Networks (BReG-Net) for Facial Affect Computing
Behzad Hasani, Pooran Singh Negi, and Mohammad H. Mahoor
Department of Electrical & Computer Engineering, University of Denver, USA
Abstract
Residual-based neural networks have shown remarkable results in various visual recognition tasks including Facial Expression Recognition (FER). Despite the tremendous efforts have been made to improve the performance of FER systems using DNNs, existing methods are not generalizable enough for practical applications. This paper introduces Bounded Residual Gradient Networks (BReG-Net) for facial expression recognition, in which the shortcut connection between the input and the output of the ResNet module is replaced with a differentiable function with a bounded gradient. This configuration prevents the network from facing the vanishing or exploding gradient problem. We show that utilizing such non-linear units will result in shallower networks with better performance. Further, by using a weighted loss function which gives a higher priority to less represented categories, we can achieve an overall better recognition rate. The results of our experiments show that BReG-Nets outperform state-of-the-art methods on three publicly available facial databases in the wild, on both the categorical and dimensional models of affect.
††publicationid: pubid: 978-1-7281-0089-0/19/$31.00 ©2019 IEEE
I INTRODUCTION
Facial expressions are one of the most important nonverbal channels for expressing internal emotions during face-to-face communication. Six expressions of anger, disgust, fear, happiness, sadness, and surprise are defined as the basic emotional expressions by Ekman et al. [5]. Automated Facial Expression Recognition (FER) has been a topic of study for decades. Although there have been many achievements in developing automated FER systems, the majority of existing methods lack the required generalization due to a use of controlled data in developing methods [25]. This is predominant because there are significant variations in facial images owing to variable scene lighting, background variation, camera view, and subjects’ head pose, gender, and ethnicity [21]. A comprehensive way of studying facial expressions is to approach the task through the concept of affective computing. Affect is a psychological term for describing the external exhibition of internal emotions and feelings. Affective computing attempts to develop systems that can interpret and estimate human affects through different channels (e.g. visual, auditory, biological signals, etc.) [29].
The dimensional modeling of affect can distinguish between subtle differences in exhibiting affect and encode small changes in the intensity of each emotion on a continuous scale, such as valence and arousal where valence shows how positive or negative an emotion is, and arousal indicates how much an event is intriguing/agitating or calming/soothing [24]. This paper focuses on developing automated algorithms for computation of the categorical and dimensional models of affect.
In the field of machine learning, one of the main tasks is to optimize a function or distribution estimation with respect to a defined measure. Based on the connectionist principle [23], deep neural networks allow us to build very complex classes of functions. A wide variety of network topologies and activation functions have been proposed in the recent years and they seem to play a crucial role in design and improving the underline class of reproducible functions available to DNNs. To pave the way of training very deep DNNs, current methods focus on improving neuron saturation or the efficiency of the gradient flow across various network’s layers. Such approaches are evident in the ReLU class of non-linear functions, and the use of identity mappings in Deep Residual Networks [11]. While having deeper architectures has shown to improve the result of recognition, one possibility is to design more complex neurons to extract more useful information at each layer of the network which results in shallower networks and less parameters but more comprehensive information and a higher recognition rate.
This paper proposes and evaluates BReG-Net (Figure 1), in which the aforementioned identity mapping is replaced with a differentiable function with a bounded gradient that results in a shallower network with a considerably better recognition rate. We evaluate our proposed method using three in the wild facial expression databases (AffectNet [20], Affect-in-the-wild [32], and FER2013 [1]) in computation of both the categorical and dimensional models of affect.
II RELATED WORK
II-A Facial expression recognition
In recent years, “Convolutional Neural Networks” (CNNs) have become the most popular approach in the field of computer vision and pattern recognition. AlexNet and GoogLeNet are among the first successful architectures proposed based on CNNs. AlexNet consists of several convolution layers followed by max-pooling layers and Rectified Linear Units (ReLUs). Szegedy et al. [27] introduced GoogLeNet which is composed of multiple “Inception” layers. Inception applies several convolutions on the feature map in different scales which extends the model both in depth and width. Mollahosseini et al. [19, 21] have used the Inception layer for the task of facial expression recognition and achieved state-of-the-art results. Following the success of Inception layers, several variations of them have been proposed [13]. Moreover, Inception layer is combined with a residual unit introduced by He et al. [10] and shows that the resulting architecture accelerates the training of Inception networks significantly [26]. Hasani et al. proposed a modification of ResNets for the task of facial expression recognition [7] and valence/arousal prediction of emotions [6]. While these methods use very deep architectures, the question of whether having a more complex building block results in a shallower and more efficient network remains unanswered. In the following, we will review some of the works that have looked into this concept.
II-B Dimensional model of affect
A few studies have been conducted on the dimensional model of affect in the literature. Nicolaou et al. [22] trained bidirectional Long Short Term Memory (LSTM) architecture on multiple engineered features extracted from audio, facial geometry, and shoulders. They achieved Root Mean Square Error (RMSE) of 0.15 and Correlation Coefficient (CC) of 0.79 for valence as well as RMSE of 0.21 and CC of 0.64 for arousal. He et al. [12] won the AVEC 2015 challenge by training multiple stacks of bidirectional LSTMs (DBLSTM-RNN) on engineered features extracted from audio (LLDs features), video (LPQ-TOP features), 52 ECG features, and 22 EDA features. They achieved RMSE of 0.104 and CC of 0.616 for valence as well as RMSE of 0.121 and CC of 0.753 for arousal. Koelstra et al. [15] trained Gaussian naive Bayes classifiers on EEG, physiological signals, and multimedia features by binary classification of low/high categories for arousal, valence, and liking on their proposed database DEAP. They achieved F1-score of 0.39, 0.37, and 0.40 on arousal, valence, and liking categories respectively.
III PROPOSED METHOD
In this paper, we propose a residual-based network in which the shortcut connection between the input and the output of the module is replaced with a differentiable function with bounded gradient. In the following, we explain each of the aforementioned concepts in detail.
III-A BReG-Net
The shortcut path in the ResNet module, which connects the input and output of the residual unit proposed, results in accelerating the convergence of the loss and simultaneously prevents the problem of vanishing/exploding gradient. The residual unit can be expressed as:
[TABLE]
where and are the input and the output of the -th unit and is a residual function. In [9], is a shortcut path, and is an ReLU function. Later on in [11], different combination of components both on and the shortcut was investigated. Hasani et al. [7] proposed a 3D ResNet based model for the task of facial expression recognition in which the shortcut was replaced with element-wise multiplication of the weight function and the input layer as follows:
[TABLE]
in which denotes the Hadamard product symbol and the weight values gradually decrease when pixels get farther away from the facial landmark points . This shows that having a more complex function than a simple shortcut (identity mapping) can help the network to extract more effective features in less number of layers which results in a shallower network and less number of parameters to be trained.
In Equation (3), it can be seen that the identity bypass mapping () is a simple choice and is not contributing to feature learning. In fact, the original motivation for using in the residual connection was to have bounded feedbacks from the loss layer to every other layers of the network. Building on this observation, we studied developing more complex residual connections with bounded gradient which enrich feature learning through the residual parts of the network. This results in richer feature maps and therefore shallower networks. We investigated several functions and replaced the shortcut path in the network with those functions. There are few limitations on choosing the suitable function and not all the functions can be used, as the network will not converge otherwise. The reason behind this is that in the training phase, we need to calculate the gradient. An improper choice of the function will cause facing with either vanishing or exploding gradient. To have a better understanding of this concept we start with the ResNet’s residual unit formulation. In this case, since we have an identical mapping of the inputs for the function , Equation (1) and its derivative will be re-written as follows:
[TABLE]
It is obvious that is differentiable and its derivative is constant which means that it is also bounded. This allows the ResNet to converge and prevents the vanishing/exploding gradient problem. Therefore, any other function that is the replacement of needs to have the same properties.
We observed several functions that have the aforementioned properties. Our experiments show that by incorporating any of these functions, the network will still converge and this is not surprising, based on the aforementioned argument. Hence, it is a matter of choosing the right function to have the best results for the facial expression task and valence/arousal prediction. Among the functions we investigated, the followings showed the most promising results:
[TABLE]
and the corresponding derivative of these functions is as follows:
Figure 3 shows the plots of these three functions and their derivatives. As shown, all of these functions are differentiable at any point and their derivatives are also bounded which shows that previously mentioned conditions hold for all of these functions. We call our network Bounded Residual Gradient Network (BReG-Net). Figure 2(b) shows the resulting building block of BReG-Net module. In our proposed network, similar to ResNet, we have dimension reductions of the tensor, achieved by down sampling (stride 2) on the first convolution layer of (Figure 2(c)). As explained in the experiments section, we stack up 39 layers of these blocks in all of our experiments and compare the results on different databases.
III-B Weighted loss
Facial expression databases are usually highly skewed. This form of imbalance is commonly referred to as intrinsic variation, i.e., it is a direct result of the nature of expressions in the real world. This phenomenon exists in both the categorical and dimensional models of affect. For instance, in AffectNet database well represented categories like happiness have almost 30 times more number of samples than less represented categories like contempt. The problem of learning from imbalanced data has two downsides. First, training data with an imbalanced distribution often causes learning algorithms to perform poorly on the less represented category [8]. Second, imbalanced test/validation data can affect the performance metrics drastically resulting in an unrealistic image of method’s performance. Jeni et al. [14] studied the influence of skew on imbalanced datasets. This study shows that except for of area under the ROC curve (AUC), many other evaluation metrics such as accuracy, F1-score, Cohen’s kappa [3], Krippendorf’s alpha [16], and area under Precision-Recall curve (AUC-PR) are affected by skewed distributions dramatically. In order to minimize skew-biased estimates of performance, the study suggests reporting both skew-normalized metrics as well as the original evaluation.
In the result section, we report the skew-normalized metrics of our methods in addition to Matthews Correlation Coefficient (MCC) [18] and Positive Predictive Value (PPV) which is often called precision. Moreover, in order to improve the recognition rate of different categories of emotions in our methods, we assign higher priority to the less represented categories of the databases in the loss calculation layer of our networks. We weigh the loss function for each of the classes by their relative proportion in the training dataset. In other words, the loss function highly penalizes the networks for misclassifying examples from under-represented categories, while it penalizes the networks less for misclassifying examples from well-represented categories. The entropy loss formulation for a training example is defined as:
[TABLE]
where denotes row penalization factor of class . is the number of classes and is the predictive softmax with values in interval indicating the predicted probability of each class as:
[TABLE]
When ( is the identity matrix), the proposed weighted-loss approach will turn to the traditional cross-entropy loss function. In other words, if the training data is completely balanced, the weighted-loss method is equal to the conventional cross-entropy loss function. We implemented this loss function in our TensorFlow model and we define the diagonal matrix as:
[TABLE]
where is the number of samples in the category and is the number of samples in the least-represented category. As mentioned earlier, this will cause the loss function to highly penalize the network for misclassifying examples from under-represented categories. In the results section we show that this improves the network recognition of under represented categories and has an overall better recognition rate.
IV EXPERIMENTS AND RESULTS
In this section, we briefly review the face databases used for evaluating our proposed method. We then report the results of our experiments using these databases evaluated on different metrics on both categorical and dimensional model of affect.
IV-A Face databases
As noted earlier, many of the traditional facial expression databases are assembled in a controlled environment while for developing a practical methods, these databases do not yield satisfying results. Therefore, we chose databases that are captured in the wild setting which contain a variety of backgrounds, lighting, pose, subject ethnicity, etc. These databases are AffectNet [20], Affect-in-Wild [32], and FER2013 [1] of which AffectNet contains labels of both categorical and dimensional models. Affect-in-Wild contains only labels of dimensional model, and FER2013 contains only labels of categorical model. AffectNet contains more than one million facial images collected from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. Affect-in-Wild contains 300 videos of different subjects watching videos of various TV shows and movies. FER2013 was created using the Google image search API. Faces are labeled with any of the six basic expressions, along with neutral. The resulting database contains 35,887 images in the wild settings.
IV-B Evaluation metrics of dimensional model
In order to evaluate our methods, we calculate and report Root Mean Square Error (RMSE), Correlation Coefficient (CC), Concordance Correlation Coefficient (CCC), and Sign AGReement (SAGR) metrics for our methods. In the following, we briefly describe the definitions of these metrics.
Root Mean Square Error (RMSE) is the most common evaluation metric in a continuous domain which is defined as:
[TABLE]
where and are the prediction and the ground-truth of sample, and is the number of samples. RMSE-based evaluation metrics can heavily weigh the outliers [2], and they do not consider covariance of the data.
Pearson’s Correlation Coefficient (CC) overcomes this problem [22] and it is defined as:
[TABLE]
where is covariance function.
Concordance Correlation Coefficient (CCC) is another metric [30] and combines CC with the square difference between the means of two compared time series:
[TABLE]
where is the Pearson correlation coefficient (CC) between two time-series (e.g., prediction and ground-truth), and are the variance of each time series, and are the standard deviation of each, and and are the mean value of each. Unlike CC, the predictions that are well correlated with the ground-truth but shifted in value are penalized in proportion to the deviation in the CCC.
The value of valence and arousal fall within the interval of [-1,+1] and correctly predicting their signs are essential in many emotion-prediction applications. Therefore, we use Sign AGReement (SAGR) metric as proposed in [22] to evaluate the performance of a valence and arousal prediction system with respect to the sign agreement. SAGR is defined as:
[TABLE]
where is the Kronecker delta function, defined as:
[TABLE]
IV-C Results
Figure 4 shows the general structure of the network. Our experiments show that yields better results in terms of both prediction rate and convergence speed. We also investigated a variety of BReG-Net architectures with shallower and deeper depths. Our experiments indicated that when the network is too shallow, the number of parameters is not enough to distinguish the subtle facial muscle changes. Figure 5 shows the results of different depths in both categorical and dimensional models of affect while using as residual function in our proposed method. Thus, we propose the architecture in Figure 4 for two tasks of prediction of categorical and dimensional model of affect. We provide the results of our experiment for each of these tasks separately. All of the proposed methods are implemented using a combination of TensorFlow [17] and TfLearn [4] toolboxes. We used Momentum optimization method with a weight decay of 0.0001, and learning rate of 0.01. Mean square error is used for the loss function of the dimensional model experiments.
IV-C1 Categorical model
Table I shows the results of our experiments with the three functions in Equation (4) as the residual function. We can see that has the best result compared to the other functions. This was true throughout all of the experiments. Therefore, due to space limitation, all of the reported results from this point are the result of function. Table II shows the result of our experiments in the categorical model of affect on AffectNet and FER2013 databases. It can be seen that weighted loss further improves the recognition rates in both databases. However, weighted-loss is data dependent while our proposed method improves the recognition rate regardless of the distribution of the data. All of the reported numbers, are the result of our experiments only on the validation set of these databases as their test sets are not publicly available for any of the databases. As it can be seen, our proposed modification of the ResNet module achieves better recognition rates compared to ResNet-110 and it also outperforms the existing methods on both AffectNet and FER2013 databases. We need to mention that [20] uses AlexNet, Wiles et al. [31] achieved 74.4 for AUC, and [19] uses an Inception-based method to classify the expressions, and [28] trained deep learning methods combined with SVMs. Our proposed method is considerably shallower than many of the methods proposed in the field.
In order to further investigate the effect of the weighted-loss method, we calculated F1-score, alpha, kappa, MCC, and PPV metrics in both cases of regular loss and weighted-loss. Tables III and IV show the results for these losses, respectively. The skew normalization is performed by random under-sampling of the classes in the test set. This process is repeated 200 times, and the skew-normalized score is the average of the score on multiple trials. It can be seen that in most cases there is an improvement of correlation in the weighted loss case which shows that our weighted loss addition to the network has a positive impact in recognition of different categories. It is important to note that the FER2013 database is an almost balanced database. Therefore, the reported results for original and skew-normalized cases have almost the same value.
IV-C2 Dimensional model
Table V shows the results of our experiments in the dimensional model of affect on the validation set of the AffectNet and Affect-in-Wild databases (test set was not released for either of the databases). It is important to point out that [20] uses AlexNet, and [6] uses an Inception-ResNet-based method to classify the expressions. The reported results are RMSE values, as other methods have only provided this metric in their work. Table V shows that our proposed method outperforms the state-of-the-art methods in terms of RMSE for both databases. Our results show significant improvement compared to methods reported in the AffectNet paper [20]. Also, as shown in the categorical model experiments, we can see significant improvement using the BReG-Net comparing to ResNet-110. Figure 6 shows that our proposed method has a higher reduction rate compared to ResNet-110 and eventually reaches a lower loss value on both training and validation sets during training.
In order to further investigate the effect of BReG-Net in the dimensional model of affect, we report the results by using the metrics of CC, CCC, and SAGR. Tables VI and VII show the values of these metrics on BReG-Net and ResNet-110, respectively. It can be seen that the sign agreement is significantly improved when using BReG-Net, and also correlation of the predicted values is higher than the ones for ResNet. Also, we can see that predicted valence values have lower RMSE while have higher correlation with ground-truth compared to their corresponding arousal values. This is not surprising as RMSE and correlation coefficient measure two different aspects of distribution of the data. These tables also show that the Affect-in-Wild database is a more challenging database as the predicted values have less correlation with the ground-truth ones.
In order to compare the computational cost of BReG-Net and ResNet, we recorded the computation time of training the model for one epoch on AffectNet database in categorical model. The average processing time of an epoch on AffectNet for BReG-Net with 4.9M parameters is 750.21 seconds and for ResNet-110 with 7.2M parameters is 836.04 seconds on a GeForce GTX 1080 Ti GPU. Therefore, our proposed method is trained considerably faster than ResNet-110 as it has less number of parameters to train.
V CONCLUSION
This paper introduces BReG-Net, a new residual-based network architecture using a differentiable and bounded gradient function instead of a shortcut path between the input and the output of the residual block for the task of affect estimation in both categorical and dimensional models of affect. Our experiments showed that recruiting more complex units will result in shallower networks with better performance. We also used weighted loss function in the categorical model, where our method gives higher priority to the under represented categories, resulting in a better recognition rate. We evaluated our proposed method on three databases of facial images captured in wild settings. Our experiments showed that the proposed method outperforms state-of-the-art methods in both tasks.
VI Acknowledgement
This paper is based upon work partially supported by the National Science Foundation under Grant No. CNS-1427872. We also thank NVIDIA for donation of a GPU to the University of Denver.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Challenges in representation learning: Facial expression recognition challenge. http://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge.
- 2[2] S. Bermejo and J. Cabestany. Oriented principal component analysis for large margin classifiers. Neural Networks , 14(10):1447–1461, 2001.
- 3[3] J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement , 20(1):37, 1960.
- 4[4] A. Damien et al. Tflearn. https://github.com/tflearn/tflearn, 2016.
- 5[5] P. Ekman and W. V. Friesen. Constants across cultures in the face and emotion. Journal of personality and social psychology , 17(2):124, 1971.
- 6[6] B. Hasani and M. H. Mahoor. Facial affect estimation in the wild using deep residual and convolutional networks. In CVPR Workshops , pages 1955–1962. IEEE, 2017.
- 7[7] B. Hasani and M. H. Mahoor. Facial expression recognition using enhanced deep 3d convolutional neural networks. In CVPR Workshops , pages 2278–2288. IEEE, 2017.
- 8[8] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering , 21(9):1263–1284, 2009.
