Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy; Augustine N. Mavor-Parker; Aengus Lynch; Stefan; Heimersheim; Adri\`a Garriga-Alonso

arXiv:2304.14997·cs.LG·October 31, 2023·31 cites

Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan, Heimersheim, Adri\`a Garriga-Alonso

PDF

Open Access 4 Repos 1 Datasets 1 Video

TL;DR

This paper introduces automated algorithms for circuit discovery in neural networks, streamlining the process of understanding model behaviors and validating interpretability methods with experiments on GPT-2 Small.

Contribution

It proposes algorithms to automate circuit identification in models, reproduces previous interpretability results, and validates the approach with experiments on GPT-2 Small.

Findings

01

ACDC rediscovered all component types in a GPT-2 Small circuit

02

Selected 68 edges out of 32,000, matching manual findings

03

Validated automated methods against previous interpretability results

Abstract

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

agaralon/ACDC-Runs
dataset· 94 dl
94 dl

Videos

Towards Automated Circuit Discovery for Mechanistic Interpretability· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Fault Detection and Control Systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Dropout