Towards Chunk-Wise Generation for Long Videos
Siyang Zhang, Ser-Nam Lim

TL;DR
This paper explores chunk-wise autoregressive methods for generating long videos, addressing memory constraints and inter-chunk consistency, and proposes a $k$-step search solution to improve long video synthesis.
Contribution
It provides a detailed survey of chunk-wise long video generation and introduces an efficient $k$-step search method to enhance inter-chunk coherence.
Findings
Chunk-wise autoregressive generation reduces memory load for long videos.
The $k$-step search improves consistency between video chunks.
Survey highlights challenges and solutions in long video synthesis.
Abstract
Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Advanced Image Processing Techniques
MethodsDiffusion
