MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou; Weimin Wang; Hanshu Yan; Weiwei Lv; Yizhe Zhu; Jiashi; Feng

arXiv:2211.11018·cs.CV·May 12, 2023·63 cites

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi, Feng

PDF

Open Access

TL;DR

MagicVideo introduces an efficient text-to-video generation framework using latent diffusion models, achieving high-quality video synthesis with significantly reduced computational costs and novel design adaptations for video data.

Contribution

It proposes a novel latent diffusion-based approach with a 3D U-Net design and new modules for efficient, high-quality video generation from text descriptions.

Findings

01

Generates smooth, high-quality videos aligned with text prompts

02

Achieves 64x reduction in FLOPs compared to existing video diffusion models

03

Produces videos at 256x256 resolution on a single GPU

Abstract

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Concatenated Skip Connection · Diffusion · Convolution · U-Net