Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

Zhixian Zhao; Wenjie Tian; Lei Xie

arXiv:2601.18321·cs.MM·February 5, 2026

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

Zhixian Zhao, Wenjie Tian, Lei Xie

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces SABER-LLM, a multimodal emotion reasoning framework that leverages a large-scale dataset and structured evidence decomposition to improve robustness and accuracy in complex social scenarios.

Contribution

The paper presents SABER, a new large-scale emotion reasoning dataset with a novel six-dimensional schema, and proposes a structured evidence decomposition paradigm for robust multimodal emotion reasoning.

Findings

01

SABER-LLM outperforms open-source baselines in complex emotion reasoning tasks.

02

The structured evidence decomposition improves cross-modal fusion and reduces unimodal dominance.

03

The model achieves robustness comparable to closed-source models in decoding emotional dynamics.

Abstract

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zhaoxiaoxian/SABER-LLM
model

Datasets

zhaoxiaoxian/SABER-Dataset
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis