# RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

**Authors:** Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

arXiv: 2508.21258 · 2025-10-31

## TL;DR

RelP introduces a relevance propagation-based method for circuit discovery in language models, offering a more faithful and computationally efficient alternative to traditional attribution patching, validated across multiple models and tasks.

## Contribution

RelP replaces local gradients with Layer-wise Relevance Propagation coefficients, improving faithfulness and efficiency in circuit discovery within language models.

## Key findings

- RelP significantly outperforms attribution patching in approximating activation patching.
- RelP achieves high correlation (0.956) in GPT-2 Large MLP analysis.
- RelP provides comparable faithfulness to Integrated Gradients without extra computational cost.

## Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21258/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21258/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/2508.21258/full.md

---
Source: https://tomesphere.com/paper/2508.21258