InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta; Iv\'an Arcuschin; Thomas Kwa; Adri\`a Garriga-Alonso

arXiv:2407.14494·cs.LG·October 14, 2025

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta, Iv\'an Arcuschin, Thomas Kwa, Adri\`a Garriga-Alonso

PDF

Open Access 3 Repos 1 Models 1 Video

TL;DR

InterpBench introduces semi-synthetic transformers with known circuits to rigorously evaluate mechanistic interpretability methods, enabling validation of techniques against models with verified internal algorithms.

Contribution

The paper presents InterpBench, a new benchmark with semi-synthetic transformers and a novel training method SIIT for assessing interpretability techniques.

Findings

01

SIIT models preserve original circuits in sparse transformers.

02

SIIT can train models with larger, complex circuits.

03

Benchmark enables validation of circuit discovery methods.

Abstract

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
cybershiptrooper/InterpBench
model· ♡ 1
♡ 1

Videos

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis