MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context   Information for Expressive Speech Synthesis

Shun Lei; Yixuan Zhou; Liyang Chen; Zhiyong Wu; Xixin Wu; Shiyin Kang,; Helen Meng

arXiv:2307.16012·cs.SD·August 1, 2023·1 cites

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang,, Helen Meng

PDF

Open Access

TL;DR

MSStyleTTS introduces a hierarchical multi-scale style modeling approach that leverages broader context for more natural and expressive speech synthesis, outperforming existing methods on audiobook datasets.

Contribution

The paper presents a novel multi-scale style modeling framework that incorporates hierarchical context information for expressive speech synthesis, which was not addressed in prior works.

Findings

01

Significant improvement over baseline methods in naturalness and expressiveness.

02

Effective modeling of multi-scale style embeddings from broader context.

03

Analysis of hierarchical context and style representations enhances understanding of speech expressiveness.

Abstract

Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques