Multimodal Video Emotion Recognition with Reliable Reasoning Priors
Zhepeng Wang, Yingjian Zhu, Guanghao Dong, Hongzhu Yi, Feng Chen, Xinming Wang, Jun Xie

TL;DR
This paper presents a multimodal emotion recognition framework that leverages trustworthy reasoning priors from MLLMs and introduces a balanced contrastive learning approach, achieving significant improvements on the MER2024 benchmark.
Contribution
It introduces a novel method to incorporate MLLM-derived reasoning priors into multimodal emotion recognition and proposes a balanced contrastive loss to address class imbalance.
Findings
Significant performance gains on MER2024 benchmark
Effective integration of reasoning priors enhances cross-modal fusion
Balanced dual-contrastive learning improves class distribution handling
Abstract
This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
