What Does it Mean for a Neural Network to Learn a "World Model"?

Kenneth Li; Fernanda Vi\'egas; Martin Wattenberg

arXiv:2507.21513·cs.AI·July 30, 2025

What Does it Mean for a Neural Network to Learn a "World Model"?

Kenneth Li, Fernanda Vi\'egas, Martin Wattenberg

PDF

TL;DR

This paper defines precise criteria for when a neural network can be said to learn a 'world model,' focusing on representing a latent state space and ensuring the model's capabilities are not trivial or solely data-driven.

Contribution

It introduces an operational, formal definition of neural network 'world models' based on data generation and representation, clarifying the concept for experimental analysis.

Findings

01

Provides a set of criteria to identify genuine world models in neural networks

02

Formalizes the notion of a computation factoring through data generation

03

Includes conditions to distinguish meaningful models from trivial ones

Abstract

We propose a set of precise criteria for saying a neural net learns and uses a "world model." The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent "state space" of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a "world model" is not a trivial consequence of the neural net's data or task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.