LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion   Models

Yaohui Wang; Xinyuan Chen; Xin Ma; Shangchen Zhou; Ziqi Huang; Yi; Wang; Ceyuan Yang; Yinan He; Jiashuo Yu; Peiqing Yang; Yuwei Guo; Tianxing; Wu; Chenyang Si; Yuming Jiang; Cunjian Chen; Chen Change Loy; Bo Dai; Dahua; Lin; Yu Qiao; Ziwei Liu

arXiv:2309.15103·cs.CV·September 28, 2023·34 cites

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi, Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing, Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua, Lin, Yu Qiao, Ziwei Liu

PDF

Open Access 2 Repos 2 Models

TL;DR

LaVie introduces a cascaded latent diffusion framework for high-quality, temporally coherent text-to-video generation, leveraging pre-trained models, novel temporal attention mechanisms, and a large diverse dataset to achieve state-of-the-art results.

Contribution

The paper presents a new cascaded latent diffusion approach for text-to-video synthesis, incorporating temporal self-attention, joint fine-tuning, and a large dataset, advancing the quality and diversity of generated videos.

Findings

01

Achieves state-of-the-art performance in text-to-video generation.

02

Effectively captures temporal correlations with simple self-attention mechanisms.

03

Demonstrates versatility in long video and personalized synthesis applications.

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging

MethodsBalanced Selection · Diffusion