Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch,, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam, Letts, Varun Jampani, Robin Rombach

TL;DR
This paper introduces Stable Video Diffusion, a high-resolution text-to-video model trained through a systematic process involving pretraining and finetuning, demonstrating competitive performance and versatility in downstream tasks.
Contribution
It presents a unified training strategy for latent video diffusion models, emphasizing the importance of curated datasets and systematic training stages for high-quality video generation.
Findings
High-quality video generation depends on curated pretraining datasets.
Finetuning on high-quality data improves video synthesis performance.
The base model effectively supports downstream tasks like image-to-video and multi-view generation.
Abstract
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion · Balanced Selection
