INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team (Alphabetical Order): Donghui Shen; Guofeng Zhang; Haomin Liu; Haoyu Ji; Hujun Bao; Hongjia Zhai; Jialin Liu; Jing Guo; Nan Wang; Siji Pan; Weihong Pan; Weijian Xie; Xianbin Liu; Xiaojun Xiang; Xiaoyu Zhang; Xinyu Chen; Yifu Wang; Yipeng Chen; Zhenzhou Fan; Zhewen Le; Zhichao Ye; Ziqiang Zhao

arXiv:2604.07209·cs.CV·April 14, 2026

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team (Alphabetical Order): Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan

PDF

1 Repo

TL;DR

INSPATIO-WORLD is a real-time 4D world simulator that uses a novel spatiotemporal autoregressive model to generate high-fidelity, interactive scenes from a single video, improving spatial consistency and realism.

Contribution

The paper introduces INSPATIO-WORLD, a framework with a STAR architecture and JDMD training method, enabling realistic, controllable, and consistent 4D scene generation from minimal input.

Findings

01

Outperforms state-of-the-art models in spatial consistency and interaction accuracy.

02

Ranks first on the WorldScore-Dynamic benchmark among real-time interactive methods.

03

Effectively uses real-world data distribution regularization to enhance fidelity.

Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

inspatio/inspatio-world
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.