Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu; Xinran Deng; Zhuoyi Yang; Jiayan Teng; Xiaotao Gu; Jie Tang

arXiv:2512.05394·cs.CV·December 8, 2025

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang

PDF

Open Access 2 Models

TL;DR

This paper analyzes the spectral properties of video VAE latent spaces and introduces regularizers to improve diffusion training, leading to faster convergence and better video generation quality.

Contribution

It identifies key spectral properties of video VAE latents and proposes lightweight regularizers to induce these properties, enhancing diffusion model performance.

Findings

01

3x faster convergence in text-to-video generation

02

10% improvement in video reward

03

Outperforms existing VAEs in experiments

Abstract

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3 \times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neuroimaging Techniques and Applications · Domain Adaptation and Few-Shot Learning