Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi, Yamagishi

TL;DR
This paper explores unsupervised learning methods for controllable speech synthesis using deep encoder-decoder models, connecting variational autoencoders and demonstrating their effectiveness in emotional speech synthesis without labeled data.
Contribution
It provides a new probabilistic interpretation of unsupervised control learning in speech synthesis and compares different autoencoder-based models for this task.
Findings
Unsupervised methods match or surpass supervised approaches in emotional speech synthesis.
Popular heuristics can be interpreted as variational inference in autoencoders.
VQ-VAEs can be derived from similar probabilistic principles.
Abstract
Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSolana Customer Service Number +1-833-534-1729
