MelGAN: Generative Adversarial Networks for Conditional Waveform   Synthesis

Kundan Kumar; Rithesh Kumar; Thibault de Boissiere; Lucas Gestin; Wei; Zhen Teoh; Jose Sotelo; Alexandre de Brebisson; Yoshua Bengio; Aaron; Courville

arXiv:1910.06711·eess.AS·December 10, 2019·598 cites

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei, Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron, Courville

PDF

Open Access 5 Repos 5 Models

TL;DR

MelGAN is a non-autoregressive, fully convolutional GAN model that generates high-quality, coherent audio waveforms efficiently for tasks like speech synthesis and music, outperforming previous models in speed and quality.

Contribution

The paper introduces MelGAN, a novel GAN architecture with architectural innovations and training techniques that enable fast, high-quality waveform synthesis across multiple audio domains.

Findings

01

Achieves high MOS scores in mel-spectrogram inversion

02

Runs over 100x faster than real-time on GPU

03

Generalizes well to unseen speakers and domains

Abstract

Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

Methods1x1 Convolution · GAN Hinge Loss · Weight Normalization · Average Pooling · Dilated Convolution · Grouped Convolution · Residual Connection · Window-based Discriminator · MelGAN Residual Block · HuMan(Expedia)||How do I get a human at Expedia?