Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention
Lev Evtodienko

TL;DR
This paper introduces an end-to-end multimodal model for group emotion recognition that fine-tunes all layers for improved accuracy, outperforming previous methods by approximately 8.5% on the VGAF dataset.
Contribution
The work presents a novel end-to-end training approach for multimodal emotion recognition, enabling joint optimization of feature extraction and fusion layers.
Findings
Achieved 60.37% validation accuracy, outperforming baseline by 8.5%.
Fine-tuning all layers improves model performance.
Effective integration of audio and visual modalities for emotion recognition.
Abstract
Classifying group-level emotions is a challenging task due to complexity of video, in which not only visual, but also audio information should be taken into consideration. Existing works on multimodal emotion recognition are using bulky approach, where pretrained neural networks are used as a feature extractors and then extracted features are being fused. However, this approach does not consider attributes of multimodal data and feature extractors cannot be fine-tuned for specific task which can be disadvantageous for overall model accuracy. To this end, our impact is twofold: (i) we train model end-to-end, which allows early layers of neural network to be adapted with taking into account later, fusion layers, of two modalities; (ii) all layers of our model was fine-tuned for downstream task of emotion recognition, so there were no need to train neural networks from scratch. Our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Music and Audio Processing
