Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
G Rajasekhar, Jahangir Alam

TL;DR
This paper introduces Inconsistency-Aware Cross-Attention (IACA), a novel method for multimodal emotion recognition that adaptively handles weak or strong relationships between audio and visual data, improving feature fusion robustness.
Contribution
The paper proposes a two-stage gating mechanism within IACA to dynamically select relevant features based on the strength of cross-modal relationships, enhancing multimodal emotion recognition.
Findings
IACA outperforms existing methods on Aff-Wild2 dataset.
The model demonstrates robustness in handling weak and strong modality relationships.
Extensive experiments validate the effectiveness of the adaptive feature selection approach.
Abstract
Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
