Effective Use of Variational Embedding Capacity in Expressive End-to-End   Speech Synthesis

Eric Battenberg; Soroosh Mariooryad; Daisy Stanton; RJ Skerry-Ryan,; Matt Shannon; David Kao; Tom Bagby

arXiv:1906.03402·cs.CL·October 29, 2019·43 cites

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan,, Matt Shannon, David Kao, Tom Bagby

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces a unified framework using embedding capacity to analyze and improve variational speech synthesis models, enabling better control over prosody, style transfer, and speaker identity preservation.

Contribution

It proposes the Capacitron model that explicitly constrains embedding capacity, enabling high-precision style transfer and hierarchical latent variable decomposition.

Findings

01

Capacitron achieves high-precision prosody and style transfer.

02

The model preserves speaker identity during transfer.

03

Hierarchical capacity decomposition allows flexible control.

Abstract

Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasminsternkopf/mel_cepstral_distance
none

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques