A Comparison of Time-based Models for Multimodal Emotion Recognition
Ege Kesim, Selahattin Serdar Helli, Sena Nur Cavsak

TL;DR
This paper compares the performance of various sequence models like GRU, LSTM, Transformer, and Max Pooling in multimodal emotion recognition using sound and video data, highlighting the effectiveness of different architectures.
Contribution
It provides a comparative analysis of sequence models for multimodal emotion recognition, demonstrating their relative performance on the CREMA-D dataset.
Findings
GRU achieved the highest F1 score of 0.640.
LSTM achieved the highest precision of 0.699.
Max Pooling showed the best sensitivity with 0.620.
Abstract
Emotion recognition has become an important research topic in the field of human-computer interaction. Studies on sound and videos to understand emotions focused mainly on analyzing facial expressions and classified 6 basic emotions. In this study, the performance of different sequence models in multi-modal emotion recognition was compared. The sound and images were first processed by multi-layered CNN models, and the outputs of these models were fed into various sequence models. The sequence model is GRU, Transformer, LSTM and Max Pooling. Accuracy, precision, and F1 Score values of all models were calculated. The multi-modal CREMA-D dataset was used in the experiments. As a result of the comparison of the CREMA-D dataset, GRU-based architecture with 0.640 showed the best result in F1 score, LSTM-based architecture with 0.699 in precision metric, while sensitivity showed the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Tanh Activation · Layer Normalization · Label Smoothing · Adam · Byte Pair Encoding
