Attribution Patching Outperforms Automated Circuit Discovery

Aaquib Syed; Can Rager; Arthur Conmy

arXiv:2310.10348·cs.LG·November 21, 2023·1 cites

Attribution Patching Outperforms Automated Circuit Discovery

Aaquib Syed, Can Rager, Arthur Conmy

PDF

Open Access 5 Repos

TL;DR

This paper introduces a simple attribution patching method that outperforms existing automated circuit discovery techniques in neural networks, requiring minimal computational passes and effectively identifying important network edges.

Contribution

The authors propose a linear approximation-based attribution patching method that surpasses prior automated circuit discovery approaches in efficiency and accuracy.

Findings

01

Our method achieves higher AUC in circuit recovery across tasks.

02

It requires only two forward passes and one backward pass.

03

The approach effectively prunes unimportant network edges.

Abstract

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science