StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis   with Distilled Time-Varying Style Diffusion

Yinghao Aaron Li; Xilin Jiang; Cong Han; and Nima Mesgarani

arXiv:2409.10058·eess.AS·September 17, 2024

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Yinghao Aaron Li, Xilin Jiang, Cong Han, and Nima Mesgarani

PDF

Open Access 1 Video

TL;DR

StyleTTS-ZS introduces an efficient zero-shot TTS system that uses style diffusion and distillation to produce natural, high-similarity speech with significantly faster inference, addressing speed and naturalness issues of prior models.

Contribution

It presents a novel style diffusion approach with distillation for zero-shot TTS, achieving high quality and speed improvements over existing models.

Findings

01

Achieves 90% faster inference speed through style diffusion distillation.

02

Surpasses state-of-the-art zero-shot TTS in naturalness and speaker similarity.

03

Maintains high speech quality with only 10k training samples.

Abstract

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings