Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

Yicheng Zhong; Peiji Yang; Zhisheng Wang

arXiv:2511.21270·cs.SD·November 27, 2025

Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

Yicheng Zhong, Peiji Yang, Zhisheng Wang

PDF

Open Access

TL;DR

This paper introduces a multi-reward reinforcement learning framework to improve prosody, stability, and naturalness in single-codebook TTS large language models, addressing common issues like prosody instability and speaker drift.

Contribution

It proposes a novel multi-reward GRPO method that directly optimizes token generation for better prosody and stability, incorporating rule-based rewards and external LLM annotations.

Findings

01

Enhanced prosodic stability and naturalness in TTS models.

02

Consistent improvements across different data sizes and model scales.

03

Additional gains when attaching a flow-matching decoder.

Abstract

Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research