Uncovering Latent Style Factors for Expressive Speech Synthesis
Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric, Battenberg, Rob Clark, Rif A. Saurous

TL;DR
This paper introduces style tokens in Tacotron to automatically learn and control diverse prosodic styles in speech synthesis without explicit annotations, enabling more expressive and consistent synthetic speech.
Contribution
It proposes a novel data-driven method using style tokens to extract and manipulate prosodic styles in end-to-end speech synthesis models.
Findings
Style tokens capture independent prosodic variations
Prosodic styles can be controlled predictably
Approach works without annotated data
Abstract
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network
