Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach
Elena Ryumina (1), Maxim Markitantov (1), Alexandr Axyonov (1), Dmitry Ryumin (1), Mikhail Dolgushin (1), Denis Dresvyanskiy (2), Alexey Karpov (1, 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) ITMO University

TL;DR
This paper introduces a multimodal approach combining face, behavior, and audio data for in-the-wild valence and arousal estimation, utilizing advanced neural models and fusion strategies to improve emotion recognition accuracy.
Contribution
The paper presents a novel multimodal fusion framework with adaptive weighting and reliability-aware strategies for emotion estimation in unconstrained environments.
Findings
Achieved a CCC of 0.658 on the Aff-Wild2 dataset.
Demonstrated the effectiveness of multimodal fusion strategies.
Improved emotion recognition performance over baseline methods.
Abstract
Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Speech and Audio Processing
