A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda,, Christian Bartelt

TL;DR
This paper provides a detailed mechanistic analysis of a transformer trained on a synthetic multi-step reasoning task, revealing interpretable internal processes and mechanisms that facilitate reasoning.
Contribution
It introduces a comprehensive mechanistic analysis method and identifies depth-bounded recurrent mechanisms in transformers trained on reasoning tasks.
Findings
Transformers use depth-bounded recurrent mechanisms.
Intermediate results are stored in specific token positions.
Identified motifs may generalize to understanding complex models.
Abstract
Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEvolutionary Algorithms and Applications
MethodsSparse Evolutionary Training
