MetaOthello: A Controlled Study of Multiple World Models in Transformers
Aviral Chawla, Galen Hall, Juniper Lovato

TL;DR
MetaOthello investigates how transformer models organize multiple, potentially conflicting world models within a shared representation space using a suite of Othello variants, revealing shared and specialized internal representations.
Contribution
Introduces MetaOthello, a controlled benchmark for studying multiple world models in transformers, and provides insights into their shared and layered organization across variants.
Findings
Transformers trained on multiple variants develop shared state representations.
Linear probes can transfer causally across variants, indicating shared internal states.
Representations are equivalent up to orthogonal rotations for isomorphic games.
Abstract
Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
