Patch-Effect Graph Kernels for LLM Interpretability
Ruben Fernandez-Boullon, David N. Olivieri

TL;DR
This paper introduces a graph-based framework for analyzing transformer interpretability, using patch-effect graphs and graph kernels to compare activation patching data across prompts and tasks.
Contribution
It reframes mechanistic interpretability as a graph machine-learning problem, providing new methods for constructing and analyzing patch-effect graphs.
Findings
Patch-effect graphs preserve discriminative structural signals.
Localized edge features outperform global graph descriptors in classification.
CI and PC methods identify influential edges with stronger activation effects.
Abstract
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
