Enhancing Conversational TTS with Cascaded Prompting and ICL-Based Online Reinforcement Learning
Zhicheng Ouyang, Seong-Gyun Leem, Bach Viet Do, Haibin Wu, Ariya Rastrow, Yuzong Liu, Florian Metze

TL;DR
This paper introduces a scalable, data-efficient TTS framework that uses cascaded prompting and ICL-based online reinforcement learning to improve expressivity and controllability in speech synthesis.
Contribution
It proposes a novel cascaded framework with human-curated audio prompts and an ICL-based online RL strategy for fine-grained style control without extensive retraining.
Findings
Significant improvements in speech naturalness and expressivity demonstrated through human evaluations.
The ICL-based online RL strategy effectively optimizes prosody with aesthetic rewards.
The approach enables single-shot adaptation to diverse speaking styles and voices.
Abstract
Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annotated training data. To overcome this data bottleneck, we present a scalable, data-efficient cascaded framework that pairs textual style tokens with human-curated, high-quality audio prompts. This approach enables single-shot adaptation to fine-grained speaking styles and character voices. In the context of TTS, this audio prompting acts as In-Context Learning (ICL), guiding the model's prosody and timbre without requiring massive parameter updates or large-scale retraining. To further enhance generation quality and mitigate hallucinations, we introduce a novel ICL-based online reinforcement learning (RL)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
