Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan,, Matt Shannon, David Kao, Tom Bagby

TL;DR
This paper introduces a unified framework using embedding capacity to analyze and improve variational speech synthesis models, enabling better control over prosody, style transfer, and speaker identity preservation.
Contribution
It proposes the Capacitron model that explicitly constrains embedding capacity, enabling high-precision style transfer and hierarchical latent variable decomposition.
Findings
Capacitron achieves high-precision prosody and style transfer.
The model preserves speaker identity during transfer.
Hierarchical capacity decomposition allows flexible control.
Abstract
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
