Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Kun Song; Jian Cong; Xinsheng Wang; Yongmao Zhang; Lei Xie; Ning; Jiang; Haiying Wu

arXiv:2210.17349·cs.SD·November 3, 2022

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Kun Song, Jian Cong, Xinsheng Wang, Yongmao Zhang, Lei Xie, Ning, Jiang, Haiying Wu

PDF

Open Access

TL;DR

Robust MelGAN is a universal neural vocoder designed for high-fidelity TTS, employing novel dropout and data augmentation techniques to enhance robustness and generalization across diverse data sources.

Contribution

The paper introduces a robust version of MelGAN with specialized dropout and data augmentation methods to improve universal applicability and sound quality in TTS systems.

Findings

01

Significantly improves sound quality across various data types.

02

Reduces metallic sound issues in neural vocoders.

03

Maintains speaker similarity with enhanced robustness.

Abstract

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing