CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, Rob Clark

TL;DR
This paper introduces a hierarchical variational autoencoder for speech synthesis that models prosodic variation at multiple linguistic levels, enabling more natural and diverse speech output from text-to-speech systems.
Contribution
It presents a novel dynamic hierarchical conditional variational autoencoder that captures prosody variation aligned with linguistic structure, improving naturalness and enabling prosody transfer.
Findings
Outperforms non-hierarchical baseline in prosody modeling.
Enables prosody transfer across sentences.
Produces more natural and lively speech signals.
Abstract
The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet · Solana Customer Service Number +1-833-534-1729
