ADAG: Automatically Describing Attribution Graphs

Aryaman Arora; Zhengxuan Wu; Jacob Steinhardt; Sarah Schwettmann

arXiv:2604.07615·cs.CL·April 10, 2026

ADAG: Automatically Describing Attribution Graphs

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

PDF

TL;DR

ADAG is an automated pipeline that describes attribution graphs in language model interpretability, replacing manual analysis with clustering and language explanations to identify functional features and their roles.

Contribution

It introduces attribution profiles, a clustering algorithm, and an LLM-based explanation system for automated circuit tracing in language models.

Findings

01

Recovered interpretable circuits in known tasks

02

Identified steerable clusters responsible for harmful jailbreaks

03

Demonstrated effectiveness on Llama 3.1 8B Instruct

Abstract

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.