WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen; Yu Zhang; Heiga Zen; Ron J. Weiss; Mohammad Norouzi,; Najim Dehak; William Chan

arXiv:2106.09660·eess.AS·June 22, 2021·5 cites

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi,, Najim Dehak, William Chan

PDF

Open Access 3 Repos 1 Datasets

TL;DR

WaveGrad 2 introduces an iterative, non-autoregressive approach to text-to-speech synthesis that generates high-quality audio by refining noise through multiple steps, offering a flexible trade-off between speed and quality.

Contribution

It presents WaveGrad 2, a novel iterative refinement model for TTS that improves upon previous vocoders by enabling flexible inference and high-fidelity audio generation.

Findings

01

High fidelity audio approaching state-of-the-art quality

02

Flexible inference speed via adjustable refinement steps

03

Effective ablation studies on model configurations

Abstract

This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditions on mel-spectrogram features, generated by a separate model. The iterative refinement process starts from Gaussian noise, and through a series of refinement steps (e.g., 50 steps), progressively recovers the audio sequence. WaveGrad 2 offers a natural way to trade-off between inference speed and sample quality, through adjusting the number of refinement steps. Experiments show that the model can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

purdueviperlab/diffssd
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsHuMan(Expedia)||How do I get a human at Expedia? · Residual Connection · 1x1 Convolution · WaveGrad UBlock · FiLM Module · Convolution · WaveGrad DBlock · WaveGrad