Learning Robust Latent Representations for Controllable Speech Synthesis
Shakti Kumar, Jithin Pradeep, Hussain Zaidi

TL;DR
This paper introduces RTI-VAE, a novel Transformer-based VAE model that improves the learning of disentangled, controllable latent representations in speech synthesis, especially with limited or noisy data.
Contribution
The paper proposes RTI-VAE, a Transformer-based VAE with layer reordering and mutual information minimization to enhance controllability and disentanglement in speech synthesis.
Findings
RTI-VAE reduces speaker attribute cluster overlap by at least 30% compared to LSTM-VAE.
RTI-VAE outperforms vanilla Transformer-VAE with at least 7% reduction in cluster overlap.
The approach improves controllability and robustness of speech synthesis models.
Abstract
State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding
