Beyond Pixel Histories: World Models with Persistent 3D State

Samuel Garcin; Thomas Walker; Steven McDonagh; Tim Pearce; Hakan Bilen; Tianyu He; Kaixin Wang; Jiang Bian

arXiv:2603.03482·cs.CV·March 5, 2026

Beyond Pixel Histories: World Models with Persistent 3D State

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

PDF

Open Access

TL;DR

The paper introduces PERSIST, a world model that simulates a latent 3D scene to generate consistent, long-term video with spatial memory, enabling more realistic and controllable 3D environment synthesis.

Contribution

It presents a novel 3D-aware world model that maintains persistent spatial memory, improving 3D consistency and enabling environment editing from a single image.

Findings

01

Significant improvements in spatial memory and 3D consistency.

02

Enhanced long-horizon stability in generated videos.

03

Ability to synthesize diverse 3D environments from a single image.

Abstract

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging