TL;DR
This paper introduces CA-MER, a benchmark for emotion conflict scenarios in multimodal emotion reasoning, and proposes MoSEAR, a framework that balances modality contributions to improve emotion recognition accuracy.
Contribution
The paper presents CA-MER for evaluating emotion conflicts and proposes MoSEAR, a novel, parameter-efficient framework that mitigates modality bias in multimodal emotion reasoning.
Findings
MoSEAR reduces modality bias during emotion conflicts.
MoSEAR achieves state-of-the-art results on multiple benchmarks.
Balanced modality integration improves emotion recognition accuracy.
Abstract
Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
