Learning Robust Latent Representations for Controllable Speech Synthesis

Shakti Kumar; Jithin Pradeep; Hussain Zaidi

arXiv:2105.04458·cs.SD·May 11, 2021

Learning Robust Latent Representations for Controllable Speech Synthesis

Shakti Kumar, Jithin Pradeep, Hussain Zaidi

PDF

TL;DR

This paper introduces RTI-VAE, a novel Transformer-based VAE model that improves the learning of disentangled, controllable latent representations in speech synthesis, especially with limited or noisy data.

Contribution

The paper proposes RTI-VAE, a Transformer-based VAE with layer reordering and mutual information minimization to enhance controllability and disentanglement in speech synthesis.

Findings

01

RTI-VAE reduces speaker attribute cluster overlap by at least 30% compared to LSTM-VAE.

02

RTI-VAE outperforms vanilla Transformer-VAE with at least 7% reduction in cluster overlap.

03

The approach improves controllability and robustness of speech synthesis models.

Abstract

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding