Rich Prosody Diversity Modelling with Phone-level Mixture Density Network
Chenpeng Du, Kai Yu

TL;DR
This paper introduces a GMM-based mixture density network for phone-level prosody modeling, significantly enhancing the naturalness and diversity of synthetic speech compared to previous uni-modal approaches.
Contribution
It presents a novel GMM-MDN approach for phone-level prosody modeling, improving diversity and naturalness in speech synthesis.
Findings
GMM-MDN generates more natural prosody patterns.
The approach significantly improves prosody diversity.
Subjective evaluations favor GMM-MDN over single Gaussian models.
Abstract
Generating natural speech with diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by human. This is largely due to the use of uni-modal distribution, such as single Gaussian, in the prior works of phone-level prosody modelling. In this work, we propose a novel approach that models phone-level prosodies with GMM based mixture density network (GMM-MDN). Experiments on the LJSpeech dataset demonstrate that phone-level prosodies can precisely control the synthetic speech and GMM-MDN can generate more natural and smooth prosody pattern than a single Gaussian. Subjective evaluations further show that the proposed approach not only achieves better naturalness, but also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
