Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Jingxiang Chen; Minseok Kim; Seong-Gyun Leem; Yin Huang; Rashi Rungta; Zhicheng Ouyang; Haibin Wu; Surya Teja Appini; Ankur Bansal; Yang Bai; Yue Liu; Florian Metze; Ahmed A Aly; Anuj Kumar; Ariya Rastrow; Zhaojiang Lin

arXiv:2603.15981·cs.CL·March 18, 2026·ACL

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze, Ahmed A Aly, Anuj Kumar, Ariya Rastrow, Zhaojiang Lin

PDF

Open Access 1 Video

TL;DR

This paper introduces a multi-task reinforcement learning approach with chain-of-thought prompting to improve paralinguistic understanding and generation in speech LLMs, addressing data scarcity and annotation challenges.

Contribution

It presents PALLM, a novel paralinguistics-aware speech LLM that jointly optimizes sentiment classification and response generation, enhancing emotional intelligence in speech models.

Findings

01

Improved paralinguistics understanding by 8-12% over baselines.

02

Effective multi-task RL approach for emotional speech modeling.

03

Outperforms proprietary models like Gemini-2.5-Pro and GPT-4o-audio.

Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning· underline

Taxonomy

TopicsEmotion and Mood Recognition · Topic Modeling · Speech Recognition and Synthesis