Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Chung-Ming Chien; Hung-yi Lee

arXiv:2011.06465·eess.AS·May 4, 2021

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Chung-Ming Chien, Hung-yi Lee

PDF

1 Repo

TL;DR

This paper introduces a hierarchical prosody modeling approach for non-autoregressive TTS that improves naturalness and quality of synthesized speech by conditioning phoneme-level prosody on word-level features.

Contribution

A novel hierarchical architecture for prosody modeling in non-autoregressive TTS that enhances naturalness and controllability of speech synthesis.

Findings

01

Outperforms competitors in audio quality

02

Achieves more natural prosody in synthesized speech

03

Validated through objective and subjective evaluations

Abstract

Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness in our objective and subjective evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ming024/FastSpeech2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.