Expressive paragraph text-to-speech synthesis with multi-step   variational autoencoder

Xuyuan Li; Zengqiang Shang; Peiyang Shi; Hua Hua; Ta Li; Pengyuan; Zhang

arXiv:2308.13365·cs.SD·September 26, 2024

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan, Zhang

PDF

Open Access

TL;DR

This paper introduces EP-MSTTS, a novel multi-level variational autoencoder-based system for highly expressive paragraph speech synthesis, effectively capturing intra-paragraph features and styles, outperforming baseline models.

Contribution

It presents the first VITS-based paragraph speech synthesis model that models style at five hierarchical levels and is trained directly on paragraph-sliced speech.

Findings

01

EP-MSTTS outperforms baseline models in experiments

02

Models style at five hierarchical levels

03

Trained directly on paragraph speech slices

Abstract

Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques