Hierarchical Generative Modeling for Controllable Speech Synthesis

Wei-Ning Hsu; Yu Zhang; Ron J. Weiss; Heiga Zen; Yonghui Wu; Yuxuan; Wang; Yuan Cao; Ye Jia; Zhifeng Chen; Jonathan Shen; Patrick Nguyen; Ruoming; Pang

arXiv:1810.07217·cs.CL·December 31, 2018·45 cites

Hierarchical Generative Modeling for Controllable Speech Synthesis

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan, Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming, Pang

PDF

Open Access 2 Repos

TL;DR

This paper introduces a hierarchical VAE-based neural TTS model that enables controllable speech synthesis by manipulating latent attributes like style, accent, and noise, even with limited annotations.

Contribution

It presents a novel hierarchical generative framework with interpretable categorical and fine-grained Gaussian latent variables for controllable speech synthesis.

Findings

01

Effective control over speech attributes demonstrated.

02

Model can infer speaker and style from noisy data.

03

High-quality synthesis with controllable attributes achieved.

Abstract

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSolana Customer Service Number +1-833-534-1729