Cross-Utterance Conditioned VAE for Speech Generation
Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei, Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

TL;DR
This paper introduces CUC-VAE S2, a novel speech synthesis framework that uses cross-utterance information and VAEs to generate more natural, expressive, and editable speech by leveraging contextual prosody from surrounding sentences.
Contribution
The paper proposes a new cross-utterance conditioned VAE framework with algorithms for TTS and speech editing, improving naturalness and flexibility in speech synthesis.
Findings
Enhanced prosody and naturalness in synthesized speech.
Effective speech editing with realistic audio generation.
Significant improvements demonstrated on LibriTTS dataset.
Abstract
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
