NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot   Speech and Singing Synthesizers

Kai Shen; Zeqian Ju; Xu Tan; Yanqing Liu; Yichong Leng; Lei He; Tao; Qin; Sheng Zhao; Jiang Bian

arXiv:2304.09116·eess.AS·May 31, 2023·37 cites

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao, Qin, Sheng Zhao, Jiang Bian

PDF

Open Access 2 Repos 1 Models 1 Datasets

TL;DR

NaturalSpeech 2 introduces a diffusion-based TTS system that leverages neural audio codecs and zero-shot learning to produce high-quality, diverse speech and singing synthesis, outperforming previous models on large-scale datasets.

Contribution

It presents a novel diffusion-based TTS framework with a speech prompting mechanism for zero-shot speech and singing synthesis, scaling to 44K hours of data.

Findings

01

Outperforms previous TTS systems in prosody and voice quality

02

Achieves high-quality zero-shot singing synthesis with only a speech prompt

03

Demonstrates robustness and diversity in large-scale datasets

Abstract

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
macminix/ChatML
model· 24 dl
24 dl

Datasets

Wenetspeech4TTS/WenetSpeech4TTS
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsDiffusion