Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

Xueyao Zhang; Yuancheng Wang; Chaoren Wang; Ziniu Li; Zhuo Chen; Zhizheng Wu

arXiv:2505.04113·cs.SD·June 9, 2025

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

Xueyao Zhang, Yuancheng Wang, Chaoren Wang, Ziniu Li, Zhuo Chen, Zhizheng Wu

PDF

Open Access 1 Video

TL;DR

This paper improves zero-shot text-to-speech intelligibility across diverse challenging scenarios by using preference alignment and a new dataset, leading to better naturalness, similarity, and quality in multiple TTS models.

Contribution

It introduces the INTP dataset and extends the DPO framework to enhance zero-shot TTS performance across various domains.

Findings

01

Enhanced intelligibility in challenging TTS scenarios.

02

Improved naturalness, similarity, and audio quality.

03

Demonstrated generalization to new TTS models.

Abstract

Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. To address these limitations, this paper leverages preference alignment techniques, which enable targeted construction of out-of-pretraining-distribution data to enhance performance. We introduce a new dataset, named the Intelligibility Preference Speech Dataset (INTP), and extend the Direct Preference Optimization (DPO) framework to accommodate diverse TTS architectures. After INTP alignment, in addition to intelligibility, we observe overall improvements including naturalness, similarity, and audio quality for multiple TTS models across diverse domains. Based on that, we also verify the weak-to-strong generalization ability of INTP for more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Language Development and Disorders