Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

Xinyu Zhao; Nikita Karagodin; Hamed Hassani; Sinan Hersek; Paul Pu Liang; Yury Polyanskiy

arXiv:2605.06870·cs.LG·May 13, 2026

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy

PDF

TL;DR

This paper identifies dimensional collapse in VQ-VAE models as a key issue and proposes an autoencoder warm-up phase to improve representation capacity, leading to better reconstruction and perceptual quality.

Contribution

It introduces a theoretical framework explaining dimensional collapse and demonstrates that an autoencoder warm-up phase effectively mitigates this problem in VQ-VAEs.

Findings

01

Warm-up phase restores representation dimension in VQ-VAEs.

02

Increases effective codebook dimension from 3-5 to 17-19.

03

Reduces rFID by 17-35% and improves PESQ by 11-14%.

Abstract

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.