Finding Interpretable Prompt-Specific Circuits in Language Models
Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella

TL;DR
This paper introduces ACC++, an advanced method for identifying interpretable, prompt-specific circuits in language models, revealing insights into model behavior and language-specific signal communication.
Contribution
ACC++ improves circuit tracing by extracting causal signals from a single pass, enabling interpretability and cross-lingual analysis of language model attention mechanisms.
Findings
Many ACC++ signals are interpretable with natural language descriptions.
Prompt-specific circuits form well-defined clusters with distinct mechanisms.
Cross-language circuits reflect linguistic relatedness.
Abstract
Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
