Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Aaron van den Oord; Yazhe Li; Igor Babuschkin; Karen Simonyan; Oriol; Vinyals; Koray Kavukcuoglu; George van den Driessche; Edward Lockhart; Luis; C. Cobo; Florian Stimberg; Norman Casagrande; Dominik Grewe; Seb Noury,; Sander Dieleman; Erich Elsen; Nal Kalchbrenner; Heiga Zen; Alex Graves; Helen; King; Tom Walters; Dan Belov; Demis Hassabis

arXiv:1711.10433·cs.LG·November 29, 2017·343 cites

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol, Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis, C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury,, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a parallel neural network for speech synthesis that achieves high fidelity and over 20 times faster-than-real-time generation, enabling practical deployment in real-time systems like Google Assistant.

Contribution

It presents Probability Density Distillation, a novel training method to convert WaveNet into a fast, parallel model without quality loss.

Findings

01

Achieves over 20x real-time speech synthesis speed

02

Maintains high speech quality comparable to WaveNet

03

Deployed in Google Assistant for multiple voices

Abstract

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

DeepMind's WaveNet, 1000 Times Faster | Two Minute Papers #232· youtube

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet