Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
R. Gnana Praveen, Eric Granger, Patrick Cardinal

TL;DR
This paper proposes a novel cross-attentional audio-visual fusion model for dimensional emotion recognition that effectively captures inter-modal relationships, outperforming existing methods on benchmark datasets.
Contribution
It introduces a cross-attention mechanism for audio-visual fusion, enhancing the extraction of salient features for continuous emotion prediction.
Findings
Outperforms state-of-the-art fusion methods on RECOLA dataset.
Demonstrates effectiveness on a private fatigue dataset.
Provides a cost-effective and accurate multimodal emotion recognition approach.
Abstract
Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing
