VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis   Vocoders

Yubing Cao; Yongming Li; Liejun Wang; Yinfeng Yu

arXiv:2408.06906·eess.AS·August 14, 2024

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu

PDF

Open Access

TL;DR

This paper introduces VNet, a GAN-based neural vocoder with a multi-tier discriminator that uses full-band spectral input to produce high-fidelity, natural-sounding speech efficiently, addressing over-smoothing issues.

Contribution

The paper presents VNet, a novel GAN-based vocoder with a multi-tier discriminator and an asymptotic constraint method to improve speech naturalness and training stability.

Findings

01

VNet achieves high-fidelity speech synthesis exceeding real-time speed.

02

The multi-tier discriminator enhances high-resolution signal generation.

03

The asymptotic constraint improves training stability and speech naturalness.

Abstract

Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing