MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei, Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron, Courville

TL;DR
MelGAN is a non-autoregressive, fully convolutional GAN model that generates high-quality, coherent audio waveforms efficiently for tasks like speech synthesis and music, outperforming previous models in speed and quality.
Contribution
The paper introduces MelGAN, a novel GAN architecture with architectural innovations and training techniques that enable fast, high-quality waveform synthesis across multiple audio domains.
Findings
Achieves high MOS scores in mel-spectrogram inversion
Runs over 100x faster than real-time on GPU
Generalizes well to unseen speakers and domains
Abstract
Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
Methods1x1 Convolution · GAN Hinge Loss · Weight Normalization · Average Pooling · Dilated Convolution · Grouped Convolution · Residual Connection · Window-based Discriminator · MelGAN Residual Block · HuMan(Expedia)||How do I get a human at Expedia?
