Style Tokens: Unsupervised Style Modeling, Control and Transfer in   End-to-End Speech Synthesis

Yuxuan Wang; Daisy Stanton; Yu Zhang; RJ Skerry-Ryan; Eric Battenberg,; Joel Shor; Ying Xiao; Fei Ren; Ye Jia; Rif A. Saurous

arXiv:1803.09017·cs.CL·March 28, 2018·474 cites

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg,, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces global style tokens (GSTs), a novel unsupervised embedding approach within Tacotron that models, controls, and transfers speech style and expressiveness without explicit labels.

Contribution

The work presents GSTs as a new method for unsupervised style modeling in end-to-end speech synthesis, enabling style control and transfer.

Findings

01

GSTs effectively model a wide range of acoustic styles

02

They enable independent control of speaking style and speed

03

GSTs can transfer styles from single clips to large text corpora

Abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
tts-hub/tacotron2_ddc_gst-zh-baker
model· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Griffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU