Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech
Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

TL;DR
This paper introduces multi-band MelGAN, a faster and more efficient waveform generation model for high-quality text-to-speech, utilizing multi-band processing and improved training techniques to enhance quality and speed.
Contribution
The paper presents multi-band MelGAN with extended receptive field, multi-resolution STFT loss, and multi-band processing, achieving high quality and efficiency in waveform synthesis.
Findings
Achieved MOS of 4.34 in waveform generation
Reduced computational complexity from 5.85 to 0.95 GFLOPS
Real-time factor of 0.03 on CPU
Abstract
In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tensorspeech/tts-mb_melgan-baker-chmodel· ♡ 5♡ 5
- 🤗tensorspeech/tts-mb_melgan-kss-komodel· ♡ 2♡ 2
- 🤗tensorspeech/tts-mb_melgan-ljspeech-enmodel· ♡ 2♡ 2
- 🤗tensorspeech/tts-mb_melgan-synpaflex-frmodel· ♡ 2♡ 2
- 🤗tensorspeech/tts-mb_melgan-thorsten-germodel· ♡ 1♡ 1
- 🤗infinisoft/ttsmodel· ♡ 4♡ 4
- 🤗Bilgilice/bilgilice35model
- 🤗praveenchordia/ttsmodel· ♡ 1♡ 1
- 🤗bookbot/mb-melgan-hifi-postnets-sw-v1model· 4 dl· ♡ 14 dl♡ 1
- 🤗antoniomae1234/voice-xtts2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
