Uncovering Intermediate Variables in Transformers using Circuit Probing

Michael A. Lepori; Thomas Serre; Ellie Pavlick

arXiv:2311.04354·cs.CL·February 13, 2025·1 cites

Uncovering Intermediate Variables in Transformers using Circuit Probing

Michael A. Lepori, Thomas Serre, Ellie Pavlick

PDF

Open Access 1 Repo

TL;DR

This paper introduces circuit probing, a novel analysis technique for neural networks that uncovers low-level circuits responsible for intermediate variables, enabling causal analysis and interpretability of models on both simple and real-world tasks.

Contribution

The paper presents circuit probing, a new method that automatically identifies circuits for hypothesized variables, extending existing analysis tools for neural network interpretability.

Findings

01

Effectively deciphered learned algorithms in models trained on arithmetic tasks.

02

Revealed modular structure within neural networks.

03

Tracked development of circuits during training.

Abstract

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlepori1/circuit_probing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification