Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

TL;DR
This paper introduces RIO, a novel reverse inference optimization method that enhances the robustness of zero-shot text-to-speech systems by leveraging self-generated speech samples and reinforcement learning from human feedback, without needing explicit reward models.
Contribution
The paper presents RIO, a new framework that improves zero-shot TTS robustness using reverse inference and RLHF, eliminating the need for reward models and reducing bad outputs.
Findings
Significantly improves subjective and objective TTS metrics.
Reduces bad output incidence to nearly zero percent.
Enhances stability by aligning training and inference conditions.
Abstract
In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
