Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim

TL;DR
This paper introduces a multimodal emotion recognition framework that combines visual, audio, and text modalities with temporal modeling and bi-directional cross-attention to improve accuracy in real-world video data.
Contribution
It presents a novel multimodal architecture with bi-directional cross-attention and temporal visual modeling, enhanced by a text-guided contrastive objective, for improved emotion recognition.
Findings
Achieved a Macro F1 score of 0.32 on ABAW 10th EXPR benchmark.
Demonstrated the effectiveness of multimodal fusion and temporal modeling.
Outperformed baseline methods significantly.
Abstract
Expression recognition in in-the-wild video data remains challenging due to substantial variations in facial appearance, background conditions, audio noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient for capturing these complex emotional cues. To address this limitation, we propose a multimodal emotion recognition framework for the Expression (EXPR) task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our framework builds on large-scale pre-trained models for visual and audio representation learning and integrates them in a unified multimodal architecture. To better capture temporal patterns in facial expression sequences, we incorporate temporal visual modeling over video windows. We further introduce a bi-directional cross-attention fusion module that enables visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Human Pose and Action Recognition
