Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori, Thomas Serre, Ellie Pavlick

TL;DR
This paper introduces circuit probing, a novel analysis technique for neural networks that uncovers low-level circuits responsible for intermediate variables, enabling causal analysis and interpretability of models on both simple and real-world tasks.
Contribution
The paper presents circuit probing, a new method that automatically identifies circuits for hypothesized variables, extending existing analysis tools for neural network interpretability.
Findings
Effectively deciphered learned algorithms in models trained on arithmetic tasks.
Revealed modular structure within neural networks.
Tracked development of circuits during training.
Abstract
Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
