Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains
Won Jang, Dan Lim, Jaesam Yoon

TL;DR
Universal MelGAN is a versatile neural vocoder capable of synthesizing high-fidelity speech across multiple domains, including unseen speakers, emotions, and languages, without external domain data.
Contribution
It introduces multi-resolution spectrogram discriminators to improve spectral detail and robustness in a multi-speaker, multi-domain neural vocoder based on MelGAN architecture.
Findings
Achieved high MOS scores in multiple scenarios
Performed well on unseen speakers, emotions, and languages
Generated high-quality speech from transformer-based mel-spectrograms
Abstract
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by alleviating the over-smoothing problem in the high frequency band of the large footprint model. Our structure generates signals close to ground-truth data without reducing the inference speed, by discriminating the waveform and spectrogram during training. The model achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input. Especially, it showed superior performance in unseen domains with regard of speaker, emotion, and language. Moreover, in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods1x1 Convolution · Dilated Convolution · Residual Connection · Grouped Convolution · GAN Hinge Loss · Weight Normalization · Convolution · Average Pooling · Tanh Activation · MelGAN Residual Block
