Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Aliyah R. Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang,, Anobel Y. Odisho, Peter R. Carroll, Bin Yu

TL;DR
This paper introduces CD-T, a novel method for efficiently discovering interpretable circuits in large language models, capable of fine-grained analysis with significantly reduced runtime and improved accuracy over previous approaches.
Contribution
The paper presents CD-T, a new mathematical approach for automated circuit discovery in transformers that is faster, more precise, and capable of finer-grained analysis than existing methods.
Findings
CD-T reduces circuit discovery runtime from hours to seconds.
CD-T outperforms ACDC and EAP in circuit recovery accuracy with 97% ROC AUC.
CD-T circuits are 80% more faithful than random circuits and can replicate model behavior with fewer nodes.
Abstract
Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or its approximations to identify subgraphs in models for specific tasks (circuits). They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads at specific sequence positions efficiently. CD-T consists of a set of mathematical equations to isolate contribution of model features. Through recursively computing contribution of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
MethodsSoftmax · Attention Is All You Need · Activation Patching · Sparse Evolutionary Training · Shapley Additive Explanations · Local Interpretable Model-Agnostic Explanations
