End-to-End Text-to-Speech Based on Latent Representation of Speaking   Styles Using Spontaneous Dialogue

Kentaro Mitsui; Tianyu Zhao; Kei Sawada; Yukiya Hono; Yoshihiko; Nankaku; Keiichi Tokuda

arXiv:2206.12040·eess.AS·June 27, 2022

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko, Nankaku, Keiichi Tokuda

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end TTS system that models speaking styles from spontaneous dialogues using latent representations, improving naturalness in dialogue-based speech synthesis.

Contribution

It proposes a two-stage training framework combining variational autoencoders with style prediction to generate contextually appropriate speech styles.

Findings

01

Outperforms original VITS in dialogue-level naturalness

02

Effectively models speaking styles from spontaneous dialogue data

03

Enhances naturalness of TTS in conversational settings

Abstract

The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis

MethodsVariational Inference