ADAG: Automatically Describing Attribution Graphs
Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

TL;DR
ADAG is an automated pipeline that describes attribution graphs in language model interpretability, replacing manual analysis with clustering and language explanations to identify functional features and their roles.
Contribution
It introduces attribution profiles, a clustering algorithm, and an LLM-based explanation system for automated circuit tracing in language models.
Findings
Recovered interpretable circuits in known tasks
Identified steerable clusters responsible for harmful jailbreaks
Demonstrated effectiveness on Llama 3.1 8B Instruct
Abstract
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
