Human-Centered Emotion Recognition in Animated GIFs
Zhengyuan Yang, Yixuan Zhang, Jiebo Luo

TL;DR
This paper introduces a human-centered approach to GIF emotion recognition using a novel neural network that emphasizes facial features and temporal dynamics, achieving superior performance and interpretability.
Contribution
The study proposes the Keypoint Attended Visual Attention Network (KAVAN), integrating facial attention and hierarchical temporal modeling for improved GIF emotion recognition.
Findings
Outperforms state-of-the-art on MIT GIFGIF dataset
Facial attention enhances recognition accuracy and interpretability
Hierarchical segment modeling captures global GIF features
Abstract
As an intuitive way of expression emotion, the animated Graphical Interchange Format (GIF) images have been widely used on social media. Most previous studies on automated GIF emotion recognition fail to effectively utilize GIF's unique properties, and this potentially limits the recognition performance. In this study, we demonstrate the importance of human related information in GIFs and conduct human-centered GIF emotion recognition with a proposed Keypoint Attended Visual Attention Network (KAVAN). The framework consists of a facial attention module and a hierarchical segment temporal module. The facial attention module exploits the strong relationship between GIF contents and human characters, and extracts frame-level visual feature with a focus on human faces. The Hierarchical Segment LSTM (HS-LSTM) module is then proposed to better learn global GIF representations. Our proposed…
| Methods | Accuracy |
|---|---|
| ResNet-50 + LSTM | |
| Soft-Att + LSTM | |
| ResNet-50 + HS-LSTM | |
| Soft-Att + HS-LSTM | |
| MTL Soft-Att + HS-LSTM |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Emotion and Mood Recognition
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
Human-Centered Emotion Recognition in Animated GIFs
Abstract
As an intuitive way of expression emotion, the animated Graphical Interchange Format (GIF) images have been widely used on social media. Most previous studies on automated GIF emotion recognition fail to effectively utilize GIF’s unique properties, and this potentially limits the recognition performance. In this study, we demonstrate the importance of human related information in GIFs and conduct human-centered GIF emotion recognition with a proposed Keypoint Attended Visual Attention Network (KAVAN). The framework consists of a facial attention module and a hierarchical segment temporal module. The facial attention module exploits the strong relationship between GIF contents and human characters, and extracts frame-level visual feature with a focus on human faces. The Hierarchical Segment LSTM (HS-LSTM) module is then proposed to better learn global GIF representations. Our proposed framework outperforms the state-of-the-art on the MIT GIFGIF dataset. Furthermore, the facial attention module provides reliable facial region mask predictions, which improves the model’s interpretability.
**Index Terms— ** Emotion Recognition, Affective Computing, Image Sequence Analysis, Visual Attention
1 Introduction
The animated Graphical Interchange Format (GIF) images have been widely used on social media for online chatting and emotion expression [1, 2]. The GIFs are short image sequences and are more light weighted compared to videos. Because of this, it can be used on social media with a lower time lag and required bandwidth. On the other hand, GIFs have a better ability to express emotions compared to still images because of the contained temporal information. By analyzing over 3.9 million posts on Tumblr, Bakhshi et al. [1] show that GIFs are significantly more engaging than other online media types. Because of GIF’s popularity, many previous studies explire automated GIF emotion recognition. Most studies [3, 4] extract visual representations for emotion recognition with pre-defined features or convolutional neural networks. Although previous approaches provide feasible solutions for GIF emotion recognition, they process GIFs as general videos and fail to utilize GIF’s unique properties. We show this potentially limit the recognition performance and propose the human-centered GIF emotion recognition.
Human and human-like characters play an importance role in GIFs. A sampling on a GIF search engine GIPHY111https://giphy.com/ shows that a majority of GIFs contain clear human or cartoon faces. A previous study [5] also reveals the importance of human faces in expressing emotions. Motivated by this, we explore human-centered GIF emotion recognition and improve recognition performance by focusing on informative facial regions. To be specific, we design a side task of facial region prediction in the proposed facial attention module, where estimated facial keypoints are used to represent human information and are fused with frame-level visual features.
Combining human keypoints with appearance features has shown its effectiveness in related video analysis tasks [6, 7]. A majority of methods merge keypoints as an extra input modality, and thus require keypoints to be complete and accurate. However, the quality of keypoints often can not be guaranteed, especially when keypoints are machine estimated instead of manually labeled. In the facial attention module, we propose to take estimated facial keypoints as the supervision for a facial region prediction side task, and use predicted regions as attention weights to further refine extracted frame-level visual features. As discussed in Section 3.1, the soft attention fusion is naturally robust against keypoint incompleteness. We further include the keypoint estimation confidence scores in the heatmap generation stage, and make KAVAN robust with respect to inaccurate keypoints. In short, the facial regions predicted by the side task refine the visual features by assigning higher weights to informative facial regions. Furthermore, the predicted facial regions improve the method’s interpretability by reliably localizing facial regions.
Another unique property for GIFs is its temporal conciseness. Unlike videos that contain a portion of ‘background frames’ to better depict a complete story, GIFs are more compact and contain few ‘redundant frames’. For example, emotions ‘embarrassment’ and ‘shame’ can only be correctly interpreted when jointly looking at all frames presented in Fig. 1. To better capture the temporal information from different segments of a GIF, we propose a Hierarchical Segment LSTM (HS-LSTM) structure as KAVAN’s temporal module. GIFs are first evenly split into several temporal segments. The coarse local segment representation is then captured by HS-LSTM nodes. Finally a global GIF representation is learned with segment features from coarse- to fine-grained.
In this study, we propose the Keypoint Attended Visual Attention Network (KAVAN), which improves GIF emotion recognition performance by effectively utilizing GIF’s unique properties. In the facial attention module, we utilize human information by merging estimated facial keypoints. Furthermore, we show that replacing the traditional LSTM layers in KAVAN with the proposed HS-LSTM structure can help better modeling temporal evolution in GIFs and further improve the recognition accuracy. Extensive experiments on the GIFGIF dataset prove the effectiveness of our methods.
2 Related Work
GIF Analysis. Bakhshi et al. [1] show that animated GIFs are more engaging than other social media types by studying over 3.9 million posts on Tumblr. Gygli et al. [8] propose to automatically generate animated GIFs from videos with 100K user-generated GIFs and the corresponding video sources. The MIT’s GIFGIF platform is frequently used for GIF emotion recognition studies. Jou et al. [3] recognize GIF emotions using color histograms, facial expressions, image based aesthetics and visual sentiment. Chen et al. [4] adopt 3D ConvNets to further improve the performance. The GIFGIF+ dataset [2] is a larger GIF emotion recognition dataset. At the time of this study, GIFGIF+ is not released.
Emotion Recognition. Emotion recognition [9, 10] has been an interesting topic for decades. On a large scale dataset [11], Rao et al. [12] propose a multi-level deep representations for emotion recognition. Multi-modal feature fusion [13] is also proved to be effective. Instead of modeling emotion recognition as a classification task [12, 11], Zhao et al. [13] propose to learn emotion distributions instead, which alleviates the perception uncertainty problem that different people under different context may perceive different emotions from the same content. Regressing emotion intensity scores [3] is another effective approach. Han et al. [14] propose a soft prediction framework for the perception uncertainty problem.
3 Methodology
In this section, we introduce the proposed Keypoint Attended Visual Attention Network (KAVAN), which consists of a facial soft attention module and a temporal module. For clarity, the soft attention module is first introduced with a traditional LSTM temporal module in Section 3.1. We then introduce the novel temporal module in KAVAN, namely the Hierarchical Segment LSTM (HS-LSTM) in Section 3.2. Finally, we discuss the training objective and the refined problem setting for GIF emotion recognition.
3.1 Keypoint Attended Visual Attention Network
One unique property for GIFs is the frequent appearance of human and cartoon faces. More than of the GIFs in the MIT GIFGIF dataset contain human faces. Moreover, many in the remaining portion contain cartoon or personated animal characters that also have abundant facial expressions. Previous studies [1, 15] also show a strong relationship between faces and the engagement level of social media contents. Motivated the importance of human faces in GIFs, we explore human-centered GIF emotion recognition.
We represent human information as estimated facial keypoints, and propose a facial soft attention module in the Keypoint Attended Visual Attention Network (KAVAN) to utilize the information by fusing keypoints with extracted frame-level visual features. A number of video action recognition studies [6, 7] have explored the fusion of keypoints and appearance features. However, previous studies require manually labeled accurate keypoints and might collapse with noisy estimated keypoints. The major challenge is that estimated keypoints can be inaccurate and incomplete, i.e. certain estimates could be wrong or missing because of occlusions or algorithm failures. In order to solve this challenge, the soft attention module in KAVAN is proposed to fuse the two modalities with attention mechanism. We first introduce the side task of facial region prediction. The predicted facial masks are then processed as attention masks to refine visual features. The soft attention module helps focusing on informative facial regions and thus contributes to GIF emotion recognition.
The proposed KAVAN structure is shown in Fig. 2. Following Temporal Segments Network (TSN) [16], GIFs are first evenly split into segments and one frame is randomly sampled from each segments as network inputs. At each time-stamp , a visual feature block is extracted with the backbone network [17], and a facial region mask is predicted. Extracted visual features are then refined by facial region mask and fed into a temporal module for GIF emotion recognition. The temporal module can be as simple as a single LSTM layer, or other more effective structures as introduced in Section 3.2. For clarity, we first introduce the base KAVAN structure with a single LSTM layer:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where are the input, forget, output, memory and hidden states. is the channel length of visual feature blocks and is the dimension of all LSTM states. is the visual feature refined by estimated facial mask . A residual link with adjustable weights is included in the facial soft attention module.
Facial masks are learned with previous hidden state and visual feature . , and are learnable weights:
[TABLE]
[TABLE]
Different from previous self-attention studies [18], facial attention masks are learned with facial keypoint heatmap supervisions and L2 losses as shown in Eq. 7, which provides the clear semantic meaning of facial regions to the learned attention masks.
[TABLE]
The heatmap is converted from estimated keypoints:
[TABLE]
where each keypoint is converted into a 2D Gaussian distribution centered at the keypoint. Keypoint estimation confidences provided by keypoint estimation algorithms are also included in heatmap generation to adjust the weights for Gaussian peaks. Finally, the overlay of keypoint heatmaps is normalized spatially with the softmax function.
The proposed soft attention fusion method is naturally robust against incomplete keypoints. Moreover, the inaccurate predictions with low confidence scores are depressed by the low keypoint estimation confidence included in the heatmap generation step. Therefore, the proposed approach is robust against both incorrect and incomplete estimated keypoints. Furthermore, it improves the method’s interpretability by reflecting attended regions. Correct facial masks can even be generated on cartoon GIFs where no estimated facial keypoints are available. The entire framework is trained end-to-end with intermediate keypoint supervision loss added to main emotion recognition loss in Eq. 12:
[TABLE]
3.2 Hierarchical Segment LSTM Network (HS-LSTM)
In Section 3.1, we introduce the base KAVAN with a single LSTM layer. Naive temporal networks tend to forget information in early stages [19]. This is not desired in GIF emotion recognition, because GIF contains less redundant frames and all frames are indispensable towards correct emotion recognition. Inspired by recent studies [20], we propose a Hierarchical Segment LSTM module (HS-LSTM) to better model long-term temporal dependencies.
Instead of learning global representations sequentially with LSTM layers, HS-LSTM first generates segment-level representations for each segment in GIFs. The segment-level representations are then propagated through different tiers for global GIF-level representations. As shown in Fig. 3, HS-LSTM contains several tiers of LSTM layers that learns representations from coarse- to fine-grained. The first tier takes the stacked features in a segment to learn a coarse segment representation. Nodes in the next tier takes corresponding frame features and the coarse representations learned in the previous tier as input, and learns a refined representation. The representations learned at different temporal resolutions are then propagated through the HS-LSTM network for a final GIF representation. The number of tiers, HS-LSTM nodes and input frames can be adjusted flexibly based on data statistics.
Finally, we show the complete KAVAN structure with HS-LSTM module integrated. The keypoint attended visual attention is only conducted in the last tier:
[TABLE]
where is the output segment representation in a same segment at all previous tiers . The input visual feature to the last tier is weight-averaged by the generated attention mask. The inputs to all other tiers remain unchanged.
3.3 Problem Formulation
In this section, we introduce the problem formulation and training objective for GIF emotion recognition. The base task is modeled as the regression of emotion intensities on all labeled emotion classes. Normalized mean squared error is used for regression, which can avoid over or under prediction [3] compared to the MSE loss. The normalized mean squared error () is defined as the mean squared error divided by the variance of the target vector.
Although intensity score regression is a good formulation for emotion recognition, it becomes increasingly challenging when the number of emotion classes increases. To alleviate this problem and meanwhile achieve a reliable understanding about coarse GIF emotion categories, we divide all labeled emotions into four coarse categories based on the circumflex affect model [21, 22]. The circumflex affect model proposes that emotions are distributed in a 2D circular space, where the vertical axis represents ‘arousal’ and the horizontal axis represents ‘valence’. With the two axes, we divide the emotions into four categories as shown in Fig. 4. We conduct a four-class-classification with cross entropy loss alongside the main regression task. Introducing the categorical emotion classification task has two advantages. First, predicted coarse emotion labels provide extra prior knowledge to the regression branch and make regression easier. Second, a reliable classification branch guarantees correct understanding for the coarse emotion type. For example, confusing ‘Happiness’ with ‘Pleasure’ is a smaller error compared to interpreting ‘Happiness’ as a negative emotion.
Finally, we include a ranking loss to preserve the rank from the strongest emotion to the most unlikely one. We show that predicting the ranking order of emotion intensity scores could also help the regression task. The proposed ranking loss is consist of the sum of pairwise ranking loss that is designed to penalize the incorrect orders:
[TABLE]
where is the emotion intensity and is the total number of emotions. The final loss for emotion recognition is:
[TABLE]
4 Experiments
We first introduce the GIFGIF dataset and facial keypoints pre-processing methods. The proposed framework is then evaluated with both classification and regression metrics.
4.1 Experiment Settings
The data used in this study is collected from a website built by MIT Media Lab named GIFGIF, and is referred to as ‘the GIFGIF dataset’. Extending from previous definition of eight emotions, 17 emotions as shown in Figure 4 are labeled to study the more detailed emotions. The dataset is labeled by distributed online users. The annotator is presented with a pair of GIFs and is asked whether GIF A, B or neither expresses a specific emotion. At the time of our data collection, we collect 6,119 GIFs with more than 3.2 million user votes. The massive user votes are converted to a 17-dimensional soft emotion intensity score with the TrueSkill algorithm [23]. Each output emotion intensity score ranges in , which is then linearly normalized into .
Besides the appearance feature, estimated facial keypoints are integrated for GIF emotion recognition. 70 facial keypoints are estimated with OpenPose [24]. We then convert the 70 keypoints in each frame into heatmaps following Eq. 8. Each keypoint corresponds to a Gaussian distribution with . The weight of each Gaussian is adjusted by the prediction confidence of the keypoints that . The keypoints around lips are denser then other facial regions according to the 70-point facial keypoint definition [24]. Therefore, the weights for the Gaussian distributions around lips is further reduced by . The initial heatmap resolution is and is later converted to after overlaying all Gaussian peaks. We randomly split of data for training and the rest for testing. The averaged performance on five random splits is reported. The processed data will be released 222https://github.com/zyang-ur/human-centered-GIF.
4.2 Categorical Emotion Classification
We first evaluate the proposed modules with the coarse emotion category classification task. The emotion categories are generated based on the most significant emotion in a GIF. The number of GIFs in each category is . As shown in Table 1, we start with a baseline that uses the ResNet-50 + LSTM structure and only the regression loss . A baseline accuracy of is achieved. We then evaluate the effectiveness of the proposed soft attention module and HS-LSTM module separately, which are referred to as Soft-Att+LSTM and ResNet-50 + HS-LSTM. The soft attention module learns an keypoint guided attention mask defined in Eq. 5. The dimension of is . and has a parameter size of and , which is both in this study. is the hidden size of the LSTM and is the channel number of visual features. With purely the soft attention module, the recognition accuracy improves from to . In the HS-LSTM module experiment, we adopt a two-tier structure with two HS-LSTM nodes of size four as shown in Fig. 3. With purely the HS-LSTM module, the accuracy improves from to . When combining the soft attention module with HS-LSTM, the KAVAN framework achieves an accuracy of , which is better than both separate modules. Furthermore, by incorporating multi-task learning with the loss proposed in Eq. 9, an extra improvements is obtained and the accuracy reaches . This proves the effectiveness of the proposed MTL setting on the classification task.
4.3 Multi-task Emotion Regression
We then show the effectiveness of the proposed modules and the multi-task learning setting with regression metrics. As shown in Table 2, the baseline model ResNet-50 + LSTM achieves an nMSE of . With the same parameters in Section 4.2, soft attention module Soft-Att + LSTM achieves an nMSE of , which is significantly better than the baseline. HS-LSTM module ResNet-50 + HS-LSTM along also outperforms the baseline LSTM by reaching an nMSE of . Finally, we evaluate the full framework with both the soft attention module and HS-LSTM adopted. The method reaches an nMSE of .
Furthermore, we fuse the regression framework with the classification branch by conducting multi-task learning. A weighted sum of the nMSE loss, the CrossEntropy loss and the ranking loss is adopted to train the framework. An extra improvement is obtained and the nMSE reaches .
We then compare our results to other state-of-the-art on MIT GIFGIF. Because the GIFGIF dataset keeps growing, the version we collect with 6,119 GIFs is larger than the one in previous study [3] with 3,858 GIFs. Therefore, a direct comparison is unfair as the task becomes more challenging with more ambiguous GIFs included. Based on our re-implementation as shown in Table 2, the Face Expression + Ordinary Least Squares Regression approach [3] works the best and achieves a nMSE of . OpenCV’s haar feature-based cascade classifiers are used for face detection. CNN+SVM facial expression features [5, 3] pretrained on a facial emotion dataset [5] are extracted on the largest detected face. Our proposed KAVAN framework achieves a nMSE of , which is better than the best re-implemented statr-of-the-art of .
4.4 Qualitative Results
As shown in Fig. 5, good qualitative results are observed. For example, the upper-left GIF in Fig. 5 belongs to category ‘Misery Arousal’ that is represented in blue, and is predicted correctly. Fittingly, the predicted emotion intensity of ‘anger’, ‘fear’ and ‘supervise’ are the highest.
Furthermore, ideal results on facial region estimation is also observed, as shown in Fig. 6. The larger unmasked image on the left of each sub-figure is the first sampled input frame, and the remaining eight smaller images are the overlay of input frames and facial keypoint heatmaps. The upper two sub-figures in Fig. 6 visualize the supervision heatmaps generated with estimated facial keypoints, which may be inaccurate or incomplete. As shown in the lower two sub-figures in Fig. 6, the attention masks predicted by KAVAN accurately focus on correct facial regions even when no original keypoint annotations are available, such as in cartoon GIFs. Experiments show that the proposed approach well utilizes the keypoints information and is robust against missing or inaccurate annotations. Furthermore, the predicted facial region masks improve the framework’s interpretability.
5 Conclusion
Motivated by GIF’s unique properties, we focus on human-centered GIF emotion recognition and propose a Keypoint Attended Visual Attention Network (KAVAN). In the facial attention module, we learn facial region masks with estimated facial keypoints to guide the GIF frame representation extraction. In the temporal module, we propose a novel Hierarchical Segment LSTM (HS-LSTM) structure to better represent the temporal evolution and learn better global representations. Experiments on the GIFGIF dataset validate the effectiveness of the proposed framework.
Acknowledgement. This work is partially supported by NSF awards #1704309, #1722847, and #1813709.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Saeideh Bakhshi, David A Shamma, Lyndon Kennedy, Yale Song, Paloma de Juan, and Joseph’Jofish’ Kaye, “Fast, cheap, and good: Why animated gifs engage us,” in CHI . ACM, 2016, pp. 575–586.
- 2[2] Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind W Picard, “Gifgif+: Collecting emotional animated gifs with clustered multi-task learning,” in ACII . IEEE, 2017, pp. 410–417.
- 3[3] Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang, “Predicting viewer perceived emotions in animated gifs,” in ACM MM . ACM, 2014, pp. 213–216.
- 4[4] Weixuan Chen and Rosalind W Picard, “Predicting perceived emotions in animated gifs with 3d convolutional neural networks,” in ISM . IEEE, 2016, pp. 367–368.
- 5[5] Yichuan Tang, “Deep learning using linear support vector machines,” ar Xiv preprint ar Xiv:1306.0239 , 2013.
- 6[6] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black, “Towards understanding action recognition,” in ICCV . IEEE, 2013, pp. 3192–3199.
- 7[7] Guilhem Chéron, Ivan Laptev, and Cordelia Schmid, “P-cnn: Pose-based cnn features for action recognition,” in ICCV , 2015, pp. 3218–3226.
- 8[8] Michael Gygli, Yale Song, and Liangliang Cao, “Video 2gif: Automatic generation of animated gifs from video,” in CVPR , 2016.
