Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

Aislan Gabriel O. Souza; Agostinho Freire; Leandro Honorato Silva; Igor Lucas B. da Silva; Jo\~ao Vin\'icius R. de Andrade; Gabriel C. de Albuquerque; Lucas Matheus da S. Oliveira; M\'ario Stela Guerra; Luciana Machado

arXiv:2603.16939·cs.CV·March 19, 2026

Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

Aislan Gabriel O. Souza, Agostinho Freire, Leandro Honorato Silva, Igor Lucas B. da Silva, Jo\~ao Vin\'icius R. de Andrade, Gabriel C. de Albuquerque, Lucas Matheus da S. Oliveira, M\'ario Stela Guerra, Luciana Machado

PDF

Open Access

TL;DR

This paper introduces a divergence-based multimodal fusion method for recognizing ambivalence and hesitancy in videos, explicitly measuring cross-modal conflicts among visual, audio, and textual data to improve detection accuracy.

Contribution

The paper presents a novel divergence-based fusion approach that explicitly models cross-modal conflict for better A/H recognition in videos.

Findings

01

Achieved a Macro F1 score of 0.6808 on the validation set, surpassing the baseline.

02

Identified temporal variability of Action Units as a key visual discriminator.

03

Demonstrated the effectiveness of explicit conflict measurement in multimodal fusion.

Abstract

We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1{,}132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Emotion and Mood Recognition · Speech and Audio Processing