Deep Encoder-Decoder Models for Unsupervised Learning of Controllable   Speech Synthesis

Gustav Eje Henter; Jaime Lorenzo-Trueba; Xin Wang; Junichi; Yamagishi

arXiv:1807.11470·eess.AS·September 11, 2018·51 cites

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi, Yamagishi

PDF

Open Access

TL;DR

This paper explores unsupervised learning methods for controllable speech synthesis using deep encoder-decoder models, connecting variational autoencoders and demonstrating their effectiveness in emotional speech synthesis without labeled data.

Contribution

It provides a new probabilistic interpretation of unsupervised control learning in speech synthesis and compares different autoencoder-based models for this task.

Findings

01

Unsupervised methods match or surpass supervised approaches in emotional speech synthesis.

02

Popular heuristics can be interpreted as variational inference in autoencoders.

03

VQ-VAEs can be derived from similar probabilistic principles.

Abstract

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSolana Customer Service Number +1-833-534-1729