Multi-band MelGAN: Faster Waveform Generation for High-Quality   Text-to-Speech

Geng Yang; Shan Yang; Kai Liu; Peng Fang; Wei Chen; Lei Xie

arXiv:2005.05106·cs.SD·November 18, 2020·21 cites

Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces multi-band MelGAN, a faster and more efficient waveform generation model for high-quality text-to-speech, utilizing multi-band processing and improved training techniques to enhance quality and speed.

Contribution

The paper presents multi-band MelGAN with extended receptive field, multi-resolution STFT loss, and multi-band processing, achieving high quality and efficiency in waveform synthesis.

Findings

01

Achieved MOS of 4.34 in waveform generation

02

Reduced computational complexity from 5.85 to 0.95 GFLOPS

03

Real-time factor of 0.03 on CPU

Abstract

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing