Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Wenyu Zhang; Yingxu He; Geyu Lin; Zhuohan Liu; Shuo Sun; Bin Wang; Xunlong Zou; Jeremy H. M. Wong; Qiongqiong Wang; Hardik B. Sailor; Nancy F. Chen; Ai Ti Aw

arXiv:2506.06820·cs.CL·September 30, 2025

Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw

PDF

Open Access

TL;DR

This paper introduces a novel framework for speech emotion reasoning using multitask AudioLLMs, enabling better emotion understanding through evidence-grounded explanations and improved generalization across datasets.

Contribution

It proposes a unified approach combining reasoning-augmented supervision, dual-encoder architecture, and task-alternating training for emotion reasoning in AudioLLMs.

Findings

01

Improves emotion prediction accuracy on IEMOCAP and MELD datasets.

02

Enhances coherence and evidential grounding of generated responses.

03

Demonstrates strong generalization on out-of-domain datasets.

Abstract

Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis