End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

Yusuke Yasuda; Xin Wang; Junichi Yamagishi

arXiv:2010.09602·eess.AS·October 21, 2020

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

Yusuke Yasuda, Xin Wang, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end text-to-speech framework that models explicit duration as a discrete latent variable using VQ-VAE, enabling joint optimization and improved alignment in TTS systems.

Contribution

The paper proposes a new TTS approach incorporating duration as a discrete latent variable via conditional VQ-VAE, with a theoretical basis and joint optimization from scratch.

Findings

01

Achieved naturalness ratings between soft-attention and explicit duration methods.

02

Demonstrated effective explicit duration modeling with a variational autoencoder.

03

Validated the approach through listening tests comparing with existing TTS methods.

Abstract

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsVQ-VAE · Solana Customer Service Number +1-833-534-1729