Interpreting Physics in Video World Models
Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, Mike Rabbat

TL;DR
This study investigates how large-scale video transformers internally represent physical variables, revealing a distinct emergence zone where physical information becomes accessible and organized in a distributed manner, differing from classical physics engines.
Contribution
It provides the first interpretability analysis of physical representations inside video encoders, identifying the Physics Emergence Zone and characterizing how physical variables are encoded.
Findings
Physical information becomes accessible at an intermediate-depth layer.
Scalar quantities like speed and acceleration are available early, while direction emerges later.
Physical representations are distributed and not factorized, yet sufficient for predictions.
Abstract
A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbodied and Extended Cognition · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
