Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina (1); Alexandr Axyonov (1); Dmitry Sysoev (2); Timur Abdulkadirov (2); Kirill Almetov (2); Yulia Morozova (2); Dmitry Ryumin (1; 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences; St. Petersburg; Russia; (2) HSE University; St. Petersburg; Russia)

arXiv:2603.12848·cs.CV·March 16, 2026

Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina (1), Alexandr Axyonov (1), Dmitry Sysoev (2), Timur Abdulkadirov (2), Kirill Almetov (2), Yulia Morozova (2), Dmitry Ryumin (1, 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) HSE University

PDF

Open Access

TL;DR

This paper presents a multimodal approach combining scene, face, audio, and text data for ambivalence and hesitancy recognition in videos, achieving significant improvements over unimodal methods in the ABAW Competition.

Contribution

The paper introduces a novel multimodal fusion framework that integrates four different modalities with prototype-augmented models for improved behavioral state recognition.

Findings

01

Multimodal fusion outperforms all unimodal baselines.

02

Best fusion model achieves 83.25% MF1 score.

03

Ensemble of models reaches 71.43% test accuracy.

Abstract

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Face recognition and analysis