EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness

Yueru Sun; Yimeng Zhang; Haoyu Gu; Nuo Chen; Dong She; Xianrong Yao; Yang Gao; Zhanpeng Jin

arXiv:2605.01024·cs.CV·May 5, 2026

EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness

Yueru Sun, Yimeng Zhang, Haoyu Gu, Nuo Chen, Dong She, Xianrong Yao, Yang Gao, Zhanpeng Jin

PDF

TL;DR

This paper introduces EmoMM, a comprehensive benchmark for multimodal emotion recognition with MLLMs, revealing a Video Contribution Collapse issue and proposing CHASE to improve decision-making under modality conflicts.

Contribution

The paper presents EmoMM benchmark, uncovers VCC phenomenon, and proposes CHASE, a lightweight attention steering method to mitigate modality bias without retraining.

Findings

01

VCC causes MLLMs to marginalize video evidence due to redundancy.

02

CHASE effectively detects conflicts and steers attention, improving performance.

03

Experimental results show CHASE enhances reliability in complex affective scenarios.

Abstract

Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.