Hierarchical prosody modeling and control in non-autoregressive parallel   neural TTS

Tuomo Raitio; Jiangchuan Li; Shreyas Seshadri

arXiv:2110.02952·eess.AS·March 24, 2022

Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS

Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri

PDF

Open Access

TL;DR

This paper introduces a hierarchical prosody control method in non-autoregressive neural TTS that enables versatile and controllable speech synthesis with high quality.

Contribution

It proposes a hierarchical conditioning approach on multiple prosodic features to improve controllability and diversity in neural TTS.

Findings

01

Effective control of multiple prosodic dimensions

02

Generation of diverse speaking styles

03

Maintains or improves speech quality

Abstract

Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS front-end model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques