RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Cong Wang; Changfeng Gao; Yang Xiang; Zhihao Du; Keyu An; Han Zhao; Qian Chen; Xiangang Li; Yingming Gao; Ya Li

arXiv:2512.04552·cs.SD·February 17, 2026

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li

PDF

Open Access

TL;DR

This paper introduces RRPO, a new framework for emotion-controlled text-to-speech that enhances robustness against reward hacking, leading to more natural and expressive speech synthesis aligned with human perception.

Contribution

RRPO employs a hybrid regularization scheme to develop a robust reward model that mitigates reward hacking and improves emotional expressiveness in TTS.

Findings

01

Robust RM generalizes well across languages.

02

RRPO significantly improves emotional expressiveness.

03

Mitigates reward hacking effectively.

Abstract

Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Digital Mental Health Interventions · Speech Recognition and Synthesis