Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction
Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting Dang

TL;DR
This paper introduces a novel framework for ambiguous emotion recognition in speech using large audio-language models, emphasizing distributional reasoning and structured thought guidance to better capture human emotional ambiguity.
Contribution
It presents the first systematic approach to ambiguity-aware reasoning in large audio-language models for emotion prediction, combining distributional objectives and chain-of-thought supervision.
Findings
Improved emotion recognition accuracy on IEMOCAP and CREMA-D datasets.
Effective alignment of predictions with human perceptual emotion distributions.
Demonstrated benefits across multiple training strategies.
Abstract
Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis
