Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and   Any-to-any Voice Conversion

Yi Lei; Shan Yang; Jian Cong; Lei Xie; Dan Su

arXiv:2207.01832·cs.SD·July 6, 2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su

PDF

Open Access

TL;DR

Glow-WaveGAN 2 introduces a unified flow-based model for high-quality zero-shot text-to-speech and voice conversion, effectively handling unseen speakers without fine-tuning by modeling a universal speech latent space.

Contribution

It extends previous Glow-WaveGAN to jointly address both acoustic modeling and vocoder stages for zero-shot synthesis, utilizing a universal WaveGAN and flow-based acoustic model.

Findings

01

Achieves high-quality zero-shot TTS and VC without fine-tuning.

02

Demonstrates superior performance on LibriTTS and VTCK datasets.

03

Effectively models a continuous speaker space for new speaker generation.

Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p (z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p (z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing