High Fidelity Speech Synthesis with Adversarial Networks
Miko{\l}aj Bi\'nkowski, Jeff Donahue, Sander Dieleman, Aidan Clark,, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

TL;DR
This paper introduces GAN-TTS, a novel adversarial network for text-to-speech synthesis that produces high-fidelity, natural-sounding speech efficiently and with high parallelism, outperforming traditional autoregressive models in quality.
Contribution
The paper presents GAN-TTS, a new GAN-based architecture for text-to-speech that achieves high-quality speech synthesis with efficient parallel generation and novel evaluation metrics.
Findings
GAN-TTS generates speech with naturalness comparable to state-of-the-art models.
It is highly parallelisable due to a feed-forward generator.
New quantitative metrics correlate well with human perception.
Abstract
Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech. Our architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyse the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced. To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS - Mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsMixture of Logistic Distributions · Dense Connections · Batch Normalization · Feedforward Network · *Communicated@Fast*How Do I Communicate to Expedia? · Tanh Activation · Off-Diagonal Orthogonal Regularization · Spectral Normalization · Conditional Batch Normalization · Average Pooling
