ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction
Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, Yonghui Wu

TL;DR
This paper introduces ParaS2S, a new benchmark and reinforcement learning framework for improving speech-to-speech models to better handle paralinguistic cues like emotion and tone, with experiments showing significant performance gains.
Contribution
The paper presents ParaS2SBench and a novel RL training strategy, PolyTone, for aligning speech models with paralinguistic attributes, addressing a key challenge in speech interaction.
Findings
Existing models perform poorly on paralinguistic cues.
ParaS2SAlign improves response appropriateness by 10%.
The proposed evaluator correlates well with human preferences.
Abstract
Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cues - such as emotion, tone, and speaker attributes - and to respond appropriately in both content and style remains under-explored. Progress is further hindered by the scarcity of high-quality and expressive demonstrations. To address this, we introduce a new reinforcement learning (RL) framework for paralinguistic-aware S2S, ParaS2S, which evaluates and optimizes both response content and speaking style directly at the waveform level. We first construct ParaS2SBench, a benchmark that evaluates the naturalness of input-output pairs in terms of content and speaking style using expressive and challenging queries. For the automatic judge, we propose a PolyTone training strategy and a multi-stage framework, preventing the style hallucination of end-to-end audio LLM judging.…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed paralinguistic-aware S2S reinforcement learning framework is highly practical, effectively enhancing the model's ability to understand and generate paralinguistic information such as emotion and tone, which provides valuable tools and methods for the advancement of speech dialogue systems. 2. The experiments are thoroughly designed, covering various paralinguistic factors and realistic scenarios. The results comprehensively validate the significant improvements in content and sty
1. The methodological innovation of the paper is limited, as it merely applies GRPO in a straightforward manner. 2. The presentation lacks intuitiveness; it is difficult to fully convey the paralinguistic features of audio through text alone. It would be better if there were a demo page or web-based showcase. 3. Some references are missing, such as [1]: [1] Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios. arXiv preprint arXiv:2501.01384.
- The paper puts forward a novel benchmark and dataset, with a welcome commitment to their public release. - The authors provide a valuable analysis of the respective impacts of RL and SFT on the modeling of non-verbal conversational aspects within the proposed framework.
* **On the Reward Model:** A point of consideration emerged regarding the reward model. I'm respectfully curious about the potential for it to be somewhat overfitted to the specific synthesis engines used for evaluation, namely the GPT-based TTS and CosyVoice. I would be interested to hear the authors' perspective on its generalization capabilities to other speech styles. * **On Data Synthesis:** Additionally, as the audio corresponding to the evaluated scenarios appears to be entirely synthetic
The motivation to evaluate paralinguistic responses in speech language models is both natural and important. This reviewer appreciates the authors’ effort in advancing research on this topic. Moreover, the overall workload presented in this paper appears substantial.
(1) The core contributions of this paper are somewhat unclear. It mainly includes two parts: a new benchmark for evaluating speech response style and content, and an alignment technique for tuning speech language models. However, each contribution appears incomplete on its own. The benchmarking part omits many relevant speech and speech-to-speech models, while the proposed alignment method lacks sufficient novelty and empirical validation. (2) The citation format should follow the ICLR template
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and dialogue systems · Speech Recognition and Synthesis
