Transformers Linearly Represent Highly Structured World Models
Roman Kniazev, Nathana\"el Fijalkow

TL;DR
This paper demonstrates that transformers trained on Sudoku traces develop internal, structured world models aligned with domain constraints, and identify a sparse, interpretable decision circuit for solving the puzzle.
Contribution
The study reveals that transformers build structured internal representations reflecting domain constraints and identifies a specific, interpretable circuit for decision-making.
Findings
Transformers organize information around Sudoku constraints rather than individual cells.
A dedicated neuron circuit detects when only one digit remains possible for a cell.
The internal model's geometry is shaped by the domain's algebraic structure.
Abstract
Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
