CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven   Dynamic Hierarchical Conditional Variational Network

Vincent Wan; Chun-an Chan; Tom Kenter; Jakub Vit; Rob Clark

arXiv:1905.07195·cs.CL·June 5, 2019·51 cites

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Vincent Wan, Chun-an Chan, Tom Kenter, Jakub Vit, Rob Clark

PDF

Open Access

TL;DR

This paper introduces a hierarchical variational autoencoder for speech synthesis that models prosodic variation at multiple linguistic levels, enabling more natural and diverse speech output from text-to-speech systems.

Contribution

It presents a novel dynamic hierarchical conditional variational autoencoder that captures prosody variation aligned with linguistic structure, improving naturalness and enabling prosody transfer.

Findings

01

Outperforms non-hierarchical baseline in prosody modeling.

02

Enables prosody transfer across sentences.

03

Produces more natural and lively speech signals.

Abstract

The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet · Solana Customer Service Number +1-833-534-1729