Conditional Variational Autoencoder with Adversarial Learning for   End-to-End Text-to-Speech

Jaehyeon Kim; Jungil Kong; Juhee Son

arXiv:2106.06103·cs.SD·June 14, 2021·121 cites

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, Juhee Son

PDF

Open Access 5 Repos 10 Models 2 Datasets 1 Video

TL;DR

This paper introduces a novel end-to-end text-to-speech model that uses variational inference, adversarial training, and stochastic duration prediction to produce more natural and diverse speech than existing systems, matching ground truth quality.

Contribution

It presents a new parallel TTS approach combining variational inference, normalizing flows, adversarial learning, and stochastic duration prediction for improved naturalness and diversity.

Findings

01

Outperforms existing TTS systems in human evaluations.

02

Achieves MOS comparable to ground truth on LJ Speech dataset.

03

Generates diverse speech with different pitches and rhythms.

Abstract

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing