Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu; Naixiang Zheng; Guoyuan Wang; Yunxiang Zhang; Lingsi Zhu; Jiaen Liang; Wei Huang; Shengping Liu

arXiv:2603.08034·cs.CV·March 10, 2026

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu

PDF

Open Access

TL;DR

This paper introduces a robust multimodal emotion recognition framework that effectively handles missing data and class imbalance using a dual-branch Transformer with safe cross-attention, modality dropout, and focal loss, achieving state-of-the-art results.

Contribution

It presents a novel multimodal Transformer architecture with safe cross-attention and modality dropout strategies for improved emotion recognition in-the-wild.

Findings

01

Achieved 60.79% accuracy on Aff-Wild2 validation set.

02

Improved F1-score to 0.5029 with the proposed methods.

03

Effectively handles missing modalities and class imbalance.

Abstract

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Music and Audio Processing