Patch-Effect Graph Kernels for LLM Interpretability

Ruben Fernandez-Boullon; David N. Olivieri

arXiv:2605.06480·cs.AI·May 8, 2026

Patch-Effect Graph Kernels for LLM Interpretability

Ruben Fernandez-Boullon, David N. Olivieri

PDF

TL;DR

This paper introduces a graph-based framework for analyzing transformer interpretability, using patch-effect graphs and graph kernels to compare activation patching data across prompts and tasks.

Contribution

It reframes mechanistic interpretability as a graph machine-learning problem, providing new methods for constructing and analyzing patch-effect graphs.

Findings

01

Patch-effect graphs preserve discriminative structural signals.

02

Localized edge features outperform global graph descriptors in classification.

03

CI and PC methods identify influential edges with stronger activation effects.

Abstract

Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.