LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi, Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing, Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua, Lin, Yu Qiao, Ziwei Liu

TL;DR
LaVie introduces a cascaded latent diffusion framework for high-quality, temporally coherent text-to-video generation, leveraging pre-trained models, novel temporal attention mechanisms, and a large diverse dataset to achieve state-of-the-art results.
Contribution
The paper presents a new cascaded latent diffusion approach for text-to-video synthesis, incorporating temporal self-attention, joint fine-tuning, and a large dataset, advancing the quality and diversity of generated videos.
Findings
Achieves state-of-the-art performance in text-to-video generation.
Effectively captures temporal correlations with simple self-attention mechanisms.
Demonstrates versatility in long video and personalized synthesis applications.
Abstract
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
MethodsBalanced Selection · Diffusion
