Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

TL;DR
This paper introduces UNO, a novel framework that integrates human subjective feedback directly into TTS training, significantly enhancing zero-shot speech quality and style adaptation without needing a reward model.
Contribution
The paper proposes a new uncertainty-aware optimization method that incorporates human feedback directly into TTS training, improving zero-shot performance and style adaptation.
Findings
Significant improvement in MOS scores and speaker similarity.
Enhanced zero-shot TTS performance on unseen speakers.
Flexible adaptation to emotional speaking styles.
Abstract
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically,…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is well-written, presenting a clear logical flow and strong motivation, supported by sufficient background knowledge. The authors effectively present the rationale behind their solution and provide intuitive explanations for the mathematical formulation, which might initially seem complex. The differences and improvements of the proposed method compared to previous work are also clearly outlined. - The UNO framework consistently shows advantages when applied to various TTS models. -
On page 5, the definitions of EDL and I-CNF are clear but somewhat overwhelming. I recommend a minor restructuring of these sections to enhance readability and make the content easier to follow.
**Integration of Human Feedback** This paper introduces an interesting approach by incorporating subjective human evaluation directly into the TTS training process through the Uncertainty-Aware Optimization (UNO) method. This aims to address a key challenge in aligning TTS models with human preferences and eliminates the need for complex reward models or preference data. **Flexible and Adaptable Method** The UNO method is presented to be adjust to different speaking styles, including emotional
**Limited Novelty of the Proposed Approach** The UNO method presents an algorithm that closely resembles the KTO algorithm, which was introduced earlier on February 2, 2024. While UNO incorporates uncertainty into its framework, the results in Section 6.3 indicate that it achieves comparable Mean Opinion Scores (MOS) to the UNO-null variant (which does not have uncertainty), with scores of 4.31+0.66 and 4.24+0.59, respectively. This suggests that the inclusion of uncertainty does not significant
1. This work addresses an important problem by incorporating human feedback into TTS model training, and it is theoretically well-founded. 2. The authors present extensive evaluations and experiments, providing a thorough analysis of the proposed framework. 3. The framework appears to be a generalizable pipeline, though additional analyses and experiments would further strengthen the work.
1. In the simulation and annotation steps, the simulators I-CNF and EDL rely on systems that are already well-established in prior research, limiting the novelty of the annotation procedure. 2. The equations in the paper could benefit from better formatting, as double references, such as "Eqn. equation," detract from readability. 3. Many important points and comparisons are located in the appendix. These should be moved to the main sections to improve the readability. The writing should be impro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Robotics and Automated Systems
