Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS
Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri

TL;DR
This paper introduces a hierarchical prosody control method in non-autoregressive neural TTS that enables versatile and controllable speech synthesis with high quality.
Contribution
It proposes a hierarchical conditioning approach on multiple prosodic features to improve controllability and diversity in neural TTS.
Findings
Effective control of multiple prosodic dimensions
Generation of diverse speaking styles
Maintains or improves speech quality
Abstract
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS front-end model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
