Waveform generation for text-to-speech synthesis using pitch-synchronous   multi-scale generative adversarial networks

Lauri Juvela; Bajibabu Bollepalli; Junichi Yamagishi; Paavo Alku

arXiv:1810.12598·eess.AS·October 31, 2018

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

PDF

TL;DR

This paper explores using GANs for waveform generation in text-to-speech synthesis, focusing on efficiency and quality, and finds that GAN-based glottal excitation models can match WaveNet vocoders in quality.

Contribution

It introduces a GAN-based approach for waveform generation in TTS, demonstrating competitive quality with traditional WaveNet vocoders, especially in modeling glottal excitation.

Findings

01

GAN-based glottal excitation model achieves comparable quality to WaveNet vocoder.

02

Direct waveform generation with GANs is still behind WaveNet in quality.

03

Parallel inference with GANs offers computational advantages.

Abstract

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMixture of Logistic Distributions · Convolution · Dilated Causal Convolution · WaveNet · Dogecoin Customer Service Number +1-833-534-1729