StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion   Model for Scalable and Controllable Scene Generation

Shangjin Zhai; Zhichao Ye; Jialin Liu; Weijian Xie; Jiaqi Hu; Zhen; Peng; Hua Xue; Danpeng Chen; Xiaomeng Wang; Lei Yang; Nan Wang; Haomin Liu,; Guofeng Zhang

arXiv:2501.05763·cs.CV·April 15, 2025

StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen, Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu,, Guofeng Zhang

PDF

Open Access

TL;DR

StarGen introduces a novel autoregressive framework utilizing a video diffusion model for scalable, long-range, and controllable scene generation with improved consistency and pose accuracy.

Contribution

It presents a new autoregressive approach with a pre-trained video diffusion model for long-range scene generation, enhancing scalability and control.

Findings

01

Outperforms state-of-the-art in scalability and fidelity.

02

Achieves high pose accuracy in scene generation.

03

Supports diverse tasks like view interpolation and city layout generation.

Abstract

Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation

MethodsDiffusion · Contrastive Language-Image Pre-training