Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Junhyeong Byeon; Jeongyeol Kim; Sejoon Lim

arXiv:2603.11971·cs.CV·March 19, 2026

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim

PDF

Open Access

TL;DR

This paper introduces a multimodal emotion recognition framework that combines visual, audio, and text modalities with temporal modeling and bi-directional cross-attention to improve accuracy in real-world video data.

Contribution

It presents a novel multimodal architecture with bi-directional cross-attention and temporal visual modeling, enhanced by a text-guided contrastive objective, for improved emotion recognition.

Findings

01

Achieved a Macro F1 score of 0.32 on ABAW 10th EXPR benchmark.

02

Demonstrated the effectiveness of multimodal fusion and temporal modeling.

03

Outperformed baseline methods significantly.

Abstract

Expression recognition in in-the-wild video data remains challenging due to substantial variations in facial appearance, background conditions, audio noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient for capturing these complex emotional cues. To address this limitation, we propose a multimodal emotion recognition framework for the Expression (EXPR) task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our framework builds on large-scale pre-trained models for visual and audio representation learning and integrates them in a unified multimodal architecture. To better capture temporal patterns in facial expression sequences, we incorporate temporal visual modeling over video windows. We further introduce a bi-directional cross-attention fusion module that enables visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Human Pose and Action Recognition