VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders
Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu

TL;DR
This paper introduces VNet, a GAN-based neural vocoder with a multi-tier discriminator that uses full-band spectral input to produce high-fidelity, natural-sounding speech efficiently, addressing over-smoothing issues.
Contribution
The paper presents VNet, a novel GAN-based vocoder with a multi-tier discriminator and an asymptotic constraint method to improve speech naturalness and training stability.
Findings
VNet achieves high-fidelity speech synthesis exceeding real-time speed.
The multi-tier discriminator enhances high-resolution signal generation.
The asymptotic constraint improves training stability and speech naturalness.
Abstract
Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
