Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra, Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti,, Jiameng Gao, Simon King

TL;DR
This paper introduces a speech synthesis model that explicitly controls prosodic features like pitch, energy, and duration, allowing for more natural and customizable speech generation with interpretable and precise control.
Contribution
The proposed model explicitly conditions on primary prosodic features, offering more interpretability and control than unsupervised latent feature models, and enables human-in-the-loop modifications for enhanced naturalness.
Findings
Generated speech is more natural than Tacotron 2 with reference encoder.
Explicit prosodic control improves interpretability and temporal precision.
Human modifications further enhance speech naturalness.
Abstract
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: , energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods[LivE@PeRson]How do I talk to a real person at Expedia? · Highway Layer · Sigmoid Activation · Long Short-Term Memory · Highway Network · Dilated Causal Convolution · Max Pooling · Tanh Activation · Bidirectional GRU · Mixture of Logistic Distributions
