MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis
Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Xixin Wu, Shiyin Kang,, Helen Meng

TL;DR
MSStyleTTS introduces a hierarchical multi-scale style modeling approach that leverages broader context for more natural and expressive speech synthesis, outperforming existing methods on audiobook datasets.
Contribution
The paper presents a novel multi-scale style modeling framework that incorporates hierarchical context information for expressive speech synthesis, which was not addressed in prior works.
Findings
Significant improvement over baseline methods in naturalness and expressiveness.
Effective modeling of multi-scale style embeddings from broader context.
Analysis of hierarchical context and style representations enhances understanding of speech expressiveness.
Abstract
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
