Discrete Acoustic Space for an Efficient Sampling in Neural   Text-To-Speech

Marek Strong; Jonas Rohnke; Antonio Bonafonte; Mateusz {\L}ajszczak,; Trevor Wood

arXiv:2110.12539·cs.SD·September 15, 2023

Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz {\L}ajszczak,, Trevor Wood

PDF

Open Access

TL;DR

This paper introduces SVQ-VAE, a novel neural TTS architecture with a split vector quantizer that improves naturalness and predictability of the acoustic space, enabling more efficient text-to-speech synthesis.

Contribution

The paper proposes SVQ-VAE, a new architecture that enhances VAE and VQ-VAE for neural TTS by using a split vector quantizer for better representation and efficiency.

Findings

01

SVQ-VAE outperforms VAE and VQ-VAE in naturalness.

02

The latent acoustic space is 32% more predictable from text.

03

Efficient prediction from text with a small discretized latent space.

Abstract

We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsVQ-VAE