Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody   Modeling

Yuepeng Jiang; Tao Li; Fengyu Yang; Lei Xie; Meng Meng; Yujun Wang

arXiv:2406.05681·cs.SD·June 12, 2024

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

PDF

Open Access

TL;DR

This paper presents a zero-shot speech synthesis model that jointly models timbre and hierarchical prosody, significantly improving naturalness and expressiveness while maintaining speaker similarity.

Contribution

It introduces a novel hierarchical prosody modeling approach with a diffusion-based pitch predictor and a global timbre vector, advancing zero-shot speech synthesis capabilities.

Findings

01

Maintains comparable timbre quality to baseline

02

Achieves better naturalness and expressiveness

03

Enhances prosody modeling with hierarchical structure

Abstract

Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling