WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion
Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang

TL;DR
WorldWarp introduces a novel framework combining 3D geometric grounding with a 2D diffusion model to generate long-range, consistent videos that effectively handle occlusions and complex camera movements.
Contribution
It couples a 3D structural cache built via Gaussian Splatting with a spatio-temporal diffusion model for improved video consistency and quality.
Findings
Achieves state-of-the-art fidelity in 3D consistent video generation.
Effectively handles occlusions and complex camera trajectories.
Maintains geometric consistency across video chunks.
Abstract
Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Advanced Vision and Imaging
