Cross-Utterance Conditioned VAE for Speech Generation

Yang Li; Cheng Yu; Guangzhi Sun; Weiqin Zu; Zheng Tian; Ying Wen; Wei; Pan; Chao Zhang; Jun Wang; Yang Yang; Fanglei Sun

arXiv:2309.04156·cs.SD·September 20, 2024

Cross-Utterance Conditioned VAE for Speech Generation

Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei, Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

PDF

Open Access

TL;DR

This paper introduces CUC-VAE S2, a novel speech synthesis framework that uses cross-utterance information and VAEs to generate more natural, expressive, and editable speech by leveraging contextual prosody from surrounding sentences.

Contribution

The paper proposes a new cross-utterance conditioned VAE framework with algorithms for TTS and speech editing, improving naturalness and flexibility in speech synthesis.

Findings

01

Enhanced prosody and naturalness in synthesized speech.

02

Effective speech editing with realistic audio generation.

03

Significant improvements demonstrated on LibriTTS dataset.

Abstract

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques