Beyond Pixel Histories: World Models with Persistent 3D State
Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

TL;DR
The paper introduces PERSIST, a world model that simulates a latent 3D scene to generate consistent, long-term video with spatial memory, enabling more realistic and controllable 3D environment synthesis.
Contribution
It presents a novel 3D-aware world model that maintains persistent spatial memory, improving 3D consistency and enabling environment editing from a single image.
Findings
Significant improvements in spatial memory and 3D consistency.
Enhanced long-horizon stability in generated videos.
Ability to synthesize diverse 3D environments from a single image.
Abstract
Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
