Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy

TL;DR
This paper identifies dimensional collapse in VQ-VAE models as a key issue and proposes an autoencoder warm-up phase to improve representation capacity, leading to better reconstruction and perceptual quality.
Contribution
It introduces a theoretical framework explaining dimensional collapse and demonstrates that an autoencoder warm-up phase effectively mitigates this problem in VQ-VAEs.
Findings
Warm-up phase restores representation dimension in VQ-VAEs.
Increases effective codebook dimension from 3-5 to 17-19.
Reduces rFID by 17-35% and improves PESQ by 11-14%.
Abstract
While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
