Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan, Heimersheim, Adri\`a Garriga-Alonso

TL;DR
This paper introduces automated algorithms for circuit discovery in neural networks, streamlining the process of understanding model behaviors and validating interpretability methods with experiments on GPT-2 Small.
Contribution
It proposes algorithms to automate circuit identification in models, reproduces previous interpretability results, and validates the approach with experiments on GPT-2 Small.
Findings
ACDC rediscovered all component types in a GPT-2 Small circuit
Selected 68 edges out of 32,000, matching manual findings
Validated automated methods against previous interpretability results
Abstract
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Fault Detection and Control Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Dropout
