Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

Yang Li; Cheng Yu; Guangzhi Sun; Hua Jiang; Fanglei Sun; Weiqin Zu,; Ying Wen; Yang Yang; Jun Wang

arXiv:2205.04120·cs.SD·May 10, 2022

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu,, Ying Wen, Yang Yang, Jun Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-utterance conditioned VAE for TTS that models prosody by leveraging context from surrounding sentences, resulting in more natural and expressive speech synthesis.

Contribution

It proposes a novel CUC-VAE that conditions on cross-utterance information to generate context-aware prosody features in TTS systems.

Findings

01

Improves naturalness and prosody diversity in synthesized speech.

02

Enhances prosody modeling by conditioning on cross-utterance context.

03

Achieves better performance on LJ-Speech and LibriTTS datasets.

Abstract

Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neurowave-ai/cucvae-tts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems