What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators
Xinyu Zhang

TL;DR
This paper investigates what internal representations world models learn in reinforcement learning, revealing they develop structured, approximately linear, and functionally relevant representations of environment states across different architectures and games.
Contribution
The study applies interpretability techniques to reveal that world models develop structured, linear, and functionally used internal representations of environment states.
Findings
Representations of game states are approximately linear and decodable.
Causal interventions confirm representations are functionally used.
Attention analysis shows spatial specialization in model heads.
Abstract
World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Embodied and Extended Cognition
