Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol, Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis, C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury,, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen

TL;DR
This paper introduces a parallel neural network for speech synthesis that achieves high fidelity and over 20 times faster-than-real-time generation, enabling practical deployment in real-time systems like Google Assistant.
Contribution
It presents Probability Density Distillation, a novel training method to convert WaveNet into a fast, parallel model without quality loss.
Findings
Achieves over 20x real-time speech synthesis speed
Maintains high speech quality comparable to WaveNet
Deployed in Google Assistant for multiple voices
Abstract
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DeepMind's WaveNet, 1000 Times Faster | Two Minute Papers #232· youtube
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet
