TL;DR
This paper introduces a hierarchical prosody modeling approach for non-autoregressive TTS that improves naturalness and quality of synthesized speech by conditioning phoneme-level prosody on word-level features.
Contribution
A novel hierarchical architecture for prosody modeling in non-autoregressive TTS that enhances naturalness and controllability of speech synthesis.
Findings
Outperforms competitors in audio quality
Achieves more natural prosody in synthesized speech
Validated through objective and subjective evaluations
Abstract
Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness in our objective and subjective evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
