NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and   Diffusion Models

Zeqian Ju; Yuancheng Wang; Kai Shen; Xu Tan; Detai Xin; Dongchao Yang,; Yanqing Liu; Yichong Leng; Kaitao Song; Siliang Tang; Zhizheng Wu; Tao Qin,; Xiang-Yang Li; Wei Ye; Shikun Zhang; Jiang Bian; Lei He; Jinyu Li; Sheng Zhao

arXiv:2403.03100·eess.AS·April 24, 2024·20 cites

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang,, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin,, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

PDF

Open Access 1 Repo 3 Models

TL;DR

NaturalSpeech 3 introduces a factorized diffusion-based TTS system that disentangles speech attributes into subspaces, enabling high-quality, zero-shot speech synthesis that surpasses current state-of-the-art models in multiple aspects.

Contribution

It proposes a novel factorized diffusion model with a neural codec using FVQ to disentangle speech into subspaces, improving zero-shot speech synthesis quality and flexibility.

Findings

01

Outperforms state-of-the-art TTS systems in quality, similarity, and prosody.

02

Achieves on-par quality with human recordings.

03

Scaling to 1B parameters and 200K hours of data further improves performance.

Abstract

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lifeiteng/naturalspeech3_facodec
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsDiffusion