NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Shengming Yin; Chenfei Wu; Huan Yang; Jianfeng Wang; Xiaodong Wang,; Minheng Ni; Zhengyuan Yang; Linjie Li; Shuguang Liu; Fan Yang; Jianlong Fu,; Gong Ming; Lijuan Wang; Zicheng Liu; Houqiang Li; Nan Duan

arXiv:2303.12346·cs.CV·March 23, 2023·1 cites

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang,, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu,, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan

PDF

Open Access

TL;DR

NUWA-XL introduces a parallel, diffusion-based architecture for extremely long video generation, enabling high-quality, coherent videos with significantly faster inference by training directly on long videos and employing a coarse-to-fine approach.

Contribution

The paper presents NUWA-XL, a novel diffusion over diffusion architecture that generates long videos in parallel, reducing inference time and training-inference gap compared to existing methods.

Findings

01

Achieves high-quality long videos with global and local coherence.

02

Reduces inference time from 7.55 minutes to 26 seconds for 1024 frames.

03

Introduces FlintstonesHD, a new benchmark for long video generation.

Abstract

In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Video Coding and Compression Technologies

MethodsDiffusion