Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

Jixun Yao; Yuguang Yang; Yu Pan; Yuan Feng; Ziqian Ning; Jianhao Ye; Hongbin Zhou; Lei Xie

arXiv:2502.02950·eess.AS·December 29, 2025

Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a fine-grained preference optimization method that improves zero-shot text-to-speech systems by targeting localized issues, leading to better robustness, reduced errors, and higher data efficiency.

Contribution

It proposes a novel fine-grained preference optimization approach that focuses on local issues in TTS outputs, enhancing robustness and data efficiency over existing methods.

Findings

01

Reduces the bad case ratio in generated speech.

02

Improves intelligibility of TTS outputs.

03

Achieves similar performance with fewer training samples.

Abstract

Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsFocus · ALIGN