VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space
Armani Rodriguez, Silvija Kokalj-Filipovic

TL;DR
VQalAttent is a lightweight, interpretable speech generation model that uses VQ-VAE and transformer architectures to produce high-quality, controllable fake speech efficiently, aiding understanding and development of advanced synthesis systems.
Contribution
The paper introduces VQalAttent, a novel, transparent pipeline combining VQ-VAE and transformer models for efficient, interpretable speech synthesis with limited computational resources.
Findings
Generates intelligible speech from discrete latent representations.
Provides insights into the relationship between latent space and speech quality.
Achieves high-quality speech synthesis with modular, transparent architecture.
Abstract
Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsVQ-VAE
