Simulating the Visual World with Artificial Intelligence: A Roadmap

Jingtong Yue; Ziqi Huang; Zhaoxi Chen; Xintao Wang; Pengfei Wan; Ziwei Liu

arXiv:2511.08585·cs.AI·February 9, 2026

Simulating the Visual World with Artificial Intelligence: A Roadmap

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu

PDF

Open Access

TL;DR

This paper surveys the evolution of video foundation models that combine implicit world modeling with video rendering to create physically plausible, interactive virtual environments for applications like robotics and gaming.

Contribution

It provides a systematic overview of the progression of video generation models into integrated world models with physical and interaction capabilities.

Findings

01

Four generations of video generation models are identified and characterized.

02

Modern models incorporate physical laws, interaction dynamics, and agent behavior.

03

Applications span robotics, autonomous driving, and interactive gaming.

Abstract

The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Social Robot Interaction and HRI