Parallel WaveGAN: A fast waveform generation model based on generative   adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto; Eunwoo Song; Jae-Min Kim

arXiv:1910.11480·eess.AS·February 7, 2020·48 cites

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

PDF

Open Access 5 Repos 4 Models

TL;DR

Parallel WaveGAN introduces a fast, non-autoregressive speech waveform generation model that achieves high fidelity and speed without the need for distillation, making it suitable for real-time applications.

Contribution

It presents a novel, distillation-free GAN-based approach for waveform generation that is compact, efficient, and maintains high speech quality.

Findings

01

Generates 24 kHz speech 28.68 times faster than real-time.

02

Achieves a MOS of 4.16 in TTS framework, comparable to distillation-based systems.

03

Uses only 1.44 million parameters, demonstrating efficiency.

Abstract

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsTest · Mixture of Logistic Distributions · *Communicated@Fast*How Do I Communicate to Expedia? · Tanh Activation · Dense Connections · Convolution · HuMan(Expedia)||How do I get a human at Expedia? · Dropout · WGAN-GP Loss · Phase Shuffle