Efficient Neural Audio Synthesis

Nal Kalchbrenner; Erich Elsen; Karen Simonyan; Seb Noury; Norman; Casagrande; Edward Lockhart; Florian Stimberg; Aaron van den Oord; Sander; Dieleman; Koray Kavukcuoglu

arXiv:1802.08435·cs.SD·June 27, 2018·75 cites

Efficient Neural Audio Synthesis

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman, Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander, Dieleman, Koray Kavukcuoglu

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces WaveRNN, a compact and efficient neural network for high-quality audio synthesis that significantly reduces sampling time, employs weight pruning for sparsity, and uses subscaling for parallel sample generation.

Contribution

The paper presents WaveRNN with a dual softmax layer, applies weight pruning for sparse networks, and introduces subscaling for parallel sample generation, advancing efficient neural audio synthesis.

Findings

01

WaveRNN matches WaveNet quality with 4x faster GPU synthesis.

02

Sparse WaveRNN outperforms dense networks at the same parameter count.

03

Subscale WaveRNN enables parallel sample generation without quality loss.

Abstract

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Nevertree/RTVC
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsPruning · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Tanh Activation · WaveRNN · Softmax