EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
Yueru Sun, Yimeng Zhang, Haoyu Gu, Nuo Chen, Dong She, Xianrong Yao, Yang Gao, Zhanpeng Jin

TL;DR
This paper introduces EmoMM, a comprehensive benchmark for multimodal emotion recognition with MLLMs, revealing a Video Contribution Collapse issue and proposing CHASE to improve decision-making under modality conflicts.
Contribution
The paper presents EmoMM benchmark, uncovers VCC phenomenon, and proposes CHASE, a lightweight attention steering method to mitigate modality bias without retraining.
Findings
VCC causes MLLMs to marginalize video evidence due to redundancy.
CHASE effectively detects conflicts and steers attention, improving performance.
Experimental results show CHASE enhances reliability in complex affective scenarios.
Abstract
Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
