Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Elena Ryumina (1); Maxim Markitantov (1); Alexandr Axyonov (1); Dmitry Ryumin (1); Mikhail Dolgushin (1); Denis Dresvyanskiy (2); Alexey Karpov (1; 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences; St. Petersburg; Russia; (2) ITMO University; St. Petersburg; Russia)

arXiv:2603.13056·cs.CV·March 16, 2026

Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Elena Ryumina (1), Maxim Markitantov (1), Alexandr Axyonov (1), Dmitry Ryumin (1), Mikhail Dolgushin (1), Denis Dresvyanskiy (2), Alexey Karpov (1, 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) ITMO University

PDF

Open Access

TL;DR

This paper introduces a multimodal approach combining face, behavior, and audio data for in-the-wild valence and arousal estimation, utilizing advanced neural models and fusion strategies to improve emotion recognition accuracy.

Contribution

The paper presents a novel multimodal fusion framework with adaptive weighting and reliability-aware strategies for emotion estimation in unconstrained environments.

Findings

01

Achieved a CCC of 0.658 on the Aff-Wild2 dataset.

02

Demonstrated the effectiveness of multimodal fusion strategies.

03

Improved emotion recognition performance over baseline methods.

Abstract

Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Speech and Audio Processing