Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning
Zhixian Zhao, Wenjie Tian, Lei Xie

TL;DR
This paper introduces SABER-LLM, a multimodal emotion reasoning framework that leverages a large-scale dataset and structured evidence decomposition to improve robustness and accuracy in complex social scenarios.
Contribution
The paper presents SABER, a new large-scale emotion reasoning dataset with a novel six-dimensional schema, and proposes a structured evidence decomposition paradigm for robust multimodal emotion reasoning.
Findings
SABER-LLM outperforms open-source baselines in complex emotion reasoning tasks.
The structured evidence decomposition improves cross-modal fusion and reduces unimodal dominance.
The model achieves robustness comparable to closed-source models in decoding emotional dynamics.
Abstract
Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
