Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large   Datasets

Andreas Blattmann; Tim Dockhorn; Sumith Kulal; Daniel Mendelevitch,; Maciej Kilian; Dominik Lorenz; Yam Levi; Zion English; Vikram Voleti; Adam; Letts; Varun Jampani; Robin Rombach

arXiv:2311.15127·cs.CV·November 28, 2023·67 cites

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch,, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam, Letts, Varun Jampani, Robin Rombach

PDF

Open Access 3 Repos

TL;DR

This paper introduces Stable Video Diffusion, a high-resolution text-to-video model trained through a systematic process involving pretraining and finetuning, demonstrating competitive performance and versatility in downstream tasks.

Contribution

It presents a unified training strategy for latent video diffusion models, emphasizing the importance of curated datasets and systematic training stages for high-quality video generation.

Findings

01

High-quality video generation depends on curated pretraining datasets.

02

Finetuning on high-quality data improves video synthesis performance.

03

The base model effectively supports downstream tasks like image-to-video and multi-view generation.

Abstract

We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion · Balanced Selection