What Does it Mean for a Neural Network to Learn a "World Model"?
Kenneth Li, Fernanda Vi\'egas, Martin Wattenberg

TL;DR
This paper defines precise criteria for when a neural network can be said to learn a 'world model,' focusing on representing a latent state space and ensuring the model's capabilities are not trivial or solely data-driven.
Contribution
It introduces an operational, formal definition of neural network 'world models' based on data generation and representation, clarifying the concept for experimental analysis.
Findings
Provides a set of criteria to identify genuine world models in neural networks
Formalizes the notion of a computation factoring through data generation
Includes conditions to distinguish meaningful models from trivial ones
Abstract
We propose a set of precise criteria for saying a neural net learns and uses a "world model." The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent "state space" of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a "world model" is not a trivial consequence of the neural net's data or task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
