WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi,, Najim Dehak, William Chan

TL;DR
WaveGrad 2 introduces an iterative, non-autoregressive approach to text-to-speech synthesis that generates high-quality audio by refining noise through multiple steps, offering a flexible trade-off between speed and quality.
Contribution
It presents WaveGrad 2, a novel iterative refinement model for TTS that improves upon previous vocoders by enabling flexible inference and high-fidelity audio generation.
Findings
High fidelity audio approaching state-of-the-art quality
Flexible inference speed via adjustable refinement steps
Effective ablation studies on model configurations
Abstract
This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditions on mel-spectrogram features, generated by a separate model. The iterative refinement process starts from Gaussian noise, and through a series of refinement steps (e.g., 50 steps), progressively recovers the audio sequence. WaveGrad 2 offers a natural way to trade-off between inference speed and sample quality, through adjusting the number of refinement steps. Experiments show that the model can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsHuMan(Expedia)||How do I get a human at Expedia? · Residual Connection · 1x1 Convolution · WaveGrad UBlock · FiLM Module · Convolution · WaveGrad DBlock · WaveGrad
