Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Xiaofeng Yu; Jiaheng Dong; Jean Honorio; Abhirup Ghosh; Hong Jia; Ting Dang

arXiv:2603.08230·cs.SD·March 10, 2026

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting Dang

PDF

Open Access

TL;DR

This paper introduces a novel framework for ambiguous emotion recognition in speech using large audio-language models, emphasizing distributional reasoning and structured thought guidance to better capture human emotional ambiguity.

Contribution

It presents the first systematic approach to ambiguity-aware reasoning in large audio-language models for emotion prediction, combining distributional objectives and chain-of-thought supervision.

Findings

01

Improved emotion recognition accuracy on IEMOCAP and CREMA-D datasets.

02

Effective alignment of predictions with human perceptual emotion distributions.

03

Demonstrated benefits across multiple training strategies.

Abstract

Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis