Automated Circuit Interpretation via Probe Prompting
Giuseppe Birardi

TL;DR
This paper introduces probe prompting, an automated method for interpreting neural network circuits by transforming attribution graphs into compact, concept-aligned subgraphs that enhance interpretability and reveal hierarchical feature transfer.
Contribution
The authors develop probe prompting, a novel automated pipeline that simplifies circuit interpretation by generating concept-aligned subgraphs with higher behavioral coherence and hierarchical insights.
Findings
Probe prompting preserves high explanatory coverage with reduced complexity.
Concept-aligned groups show significantly higher behavioral coherence than geometric clustering.
Layerwise tests reveal early features transfer robustly, while late features specialize for output promotion.
Abstract
Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic "capitals" circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
