Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Jaehyeon Kim, Jungil Kong, Juhee Son

TL;DR
This paper introduces a novel end-to-end text-to-speech model that uses variational inference, adversarial training, and stochastic duration prediction to produce more natural and diverse speech than existing systems, matching ground truth quality.
Contribution
It presents a new parallel TTS approach combining variational inference, normalizing flows, adversarial learning, and stochastic duration prediction for improved naturalness and diversity.
Findings
Outperforms existing TTS systems in human evaluations.
Achieves MOS comparable to ground truth on LJ Speech dataset.
Generates diverse speech with different pitches and rhythms.
Abstract
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗projecte-aina/tts-ca-coqui-vits-multispeakermodel· ♡ 4♡ 4
- 🤗ulysses115/pmvoicemodel· 2 dl2 dl
- 🤗infinisoft/ttsmodel· ♡ 4♡ 4
- 🤗proxectonos/Nos_TTS-sabela-vits-phonemesmodel· ♡ 2♡ 2
- 🤗proxectonos/Nos_TTS-celtia-vits-graphemesmodel· ♡ 1♡ 1
- 🤗xihan123/so-vits-svc-5.0-ninemodel· ♡ 5♡ 5
- 🤗Bilgilice/bilgilice35model
- 🤗lojban/text-to-speechmodel
- 🤗bangla-speech-processing/bangla_tts_femalemodel· 22 dl· ♡ 522 dl♡ 5
- 🤗bangla-speech-processing/bangla_tts_malemodel· 14 dl· ♡ 114 dl♡ 1
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
