VQalAttent: a Transparent Speech Generation Pipeline based on   Transformer-learned VQ-VAE Latent Space

Armani Rodriguez; Silvija Kokalj-Filipovic

arXiv:2411.14642·cs.LG·November 25, 2024

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

Armani Rodriguez, Silvija Kokalj-Filipovic

PDF

Open Access

TL;DR

VQalAttent is a lightweight, interpretable speech generation model that uses VQ-VAE and transformer architectures to produce high-quality, controllable fake speech efficiently, aiding understanding and development of advanced synthesis systems.

Contribution

The paper introduces VQalAttent, a novel, transparent pipeline combining VQ-VAE and transformer models for efficient, interpretable speech synthesis with limited computational resources.

Findings

01

Generates intelligible speech from discrete latent representations.

02

Provides insights into the relationship between latent space and speech quality.

03

Achieves high-quality speech synthesis with modular, transparent architecture.

Abstract

Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsVQ-VAE