CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu, Li, Wenbo Hu, Ying Shan

TL;DR
This paper introduces CV-VAE, a novel continuous 3D video VAE whose latent space is compatible with image VAEs like Stable Diffusion, enabling efficient training and improved video generation quality.
Contribution
We propose a new latent space regularization technique to align video VAE latent space with image VAE, facilitating seamless integration with pre-trained models and enhancing video generation.
Findings
Enables training of video models with four times more frames
Achieves compatibility with existing image VAEs like Stable Diffusion
Demonstrates improved video generation quality and efficiency
Abstract
Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsDiffusion
