VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoonyoung Cho, Injung Kim

TL;DR
VocGAN is a novel neural vocoder that achieves high-fidelity, real-time speech synthesis with significantly improved quality over previous models like MelGAN and Parallel WaveGAN, using a hierarchical adversarial network.
Contribution
It introduces a multi-scale waveform generator and a hierarchically-nested discriminator to enhance speech quality and consistency in real-time vocoding.
Findings
VocGAN synthesizes speech 416.7x faster than real-time on GPU.
VocGAN outperforms MelGAN in quality metrics including MOS.
VocGAN is 6.98x faster than Parallel WaveGAN on CPU with higher MOS.
Abstract
We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Model Reduction and Neural Networks
MethodsVocGAN · Dense Connections · Average Pooling · 1x1 Convolution · Residual Connection · Phase Shuffle · HuMan(Expedia)||How do I get a human at Expedia? · GAN Hinge Loss · Dilated Convolution · WGAN-GP Loss
