From Masks to Worlds: A Hitchhiker's Guide to World Models

Jinbin Bai; Yu Lei; Hecong Wu; Yuchen Zhu; Shufan Li; Yi Xin; Xiangtai Li; Molei Tao; Aditya Grover; Ming-Hsuan Yang

arXiv:2510.20668·cs.LG·October 24, 2025

From Masks to Worlds: A Hitchhiker's Guide to World Models

Jinbin Bai, Yu Lei, Hecong Wu, Yuchen Zhu, Shufan Li, Yi Xin, Xiangtai Li, Molei Tao, Aditya Grover, Ming-Hsuan Yang

PDF

Open Access

TL;DR

This paper guides readers through the evolution of world models, emphasizing the core generative, interactive, and memory components as the most promising path towards building true world models.

Contribution

It provides a focused overview of the development of world models, highlighting key architectural paradigms and their progression towards more integrated systems.

Findings

01

Unified architectures share a common paradigm

02

Interactive generative models close the action-perception loop

03

Memory-augmented systems sustain consistent worlds over time

Abstract

This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a ``world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Embodied and Extended Cognition · Human Motion and Animation