CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Sijie Zhao; Yong Zhang; Xiaodong Cun; Shaoshu Yang; Muyao Niu; Xiaoyu; Li; Wenbo Hu; Ying Shan

arXiv:2405.20279·cs.CV·October 24, 2024·1 cites

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu, Li, Wenbo Hu, Ying Shan

PDF

Open Access 1 Repo

TL;DR

This paper introduces CV-VAE, a novel continuous 3D video VAE whose latent space is compatible with image VAEs like Stable Diffusion, enabling efficient training and improved video generation quality.

Contribution

We propose a new latent space regularization technique to align video VAE latent space with image VAE, facilitating seamless integration with pre-trained models and enhancing video generation.

Findings

01

Enables training of video models with four times more frames

02

Achieves compatibility with existing image VAEs like Stable Diffusion

03

Demonstrates improved video generation quality and efficiency

Abstract

Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ailab-cvc/cv-vae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsDiffusion