RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari; Oliver Eberle; Ashkan Khakzar; Neel Nanda

arXiv:2508.21258·cs.LG·October 31, 2025

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

PDF

TL;DR

RelP introduces a relevance propagation-based method for circuit discovery in language models, offering a more faithful and computationally efficient alternative to traditional attribution patching, validated across multiple models and tasks.

Contribution

RelP replaces local gradients with Layer-wise Relevance Propagation coefficients, improving faithfulness and efficiency in circuit discovery within language models.

Findings

01

RelP significantly outperforms attribution patching in approximating activation patching.

02

RelP achieves high correlation (0.956) in GPT-2 Large MLP analysis.

03

RelP provides comparable faithfulness to Integrated Gradients without extra computational cost.

Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining…

Tables2

Table 1. Table 4: Propagation rules applied in our LRP implementation across different model families. A green tick ( ✓ ) indicates that the rule was applied, while a red cross ( × ) indicates that it was not used.

Propagation Rule	GPT-2 Family	Pythia Family	Qwen2 Family	Gemma2-2B
LN-rule [26]	✓	✓	✓	✓
Identity-rule [55]	✓	✓	✓	✓
$0$ -rule [52]	✓	✓	✓	✓
Half-rule [55, 57]	×	×	✓	✓
AH-rule [26]	×	×	×	×

Table 2. Table 5 : Pearson correlation coefficient (PCC) between activation patching and attribution patching (AtP) or relevance patching (RelP), computed over 100 IOI prompts for three GPT-2 model sizes (Small, Medium, Large), two Pythia models (70M, 410M), two Qwen2 models (0.5B, 7B), and Gemma2-2B. A higher value of PCC represents higher alignment with activation patching results.

	Residual Stream		Attention Outputs		MLP Outputs
Model	AtP	RelP (ours)	AtP	RelP (ours)	AtP	RelP (ours)
GPT-2 small	0.753	0.968	0.979	0.962	0.195	0.992
GPT-2 medium	0.671	0.884	0.961	0.982	0.551	0.993
GPT-2 large	0.740	0.853	0.461	0.965	0.006	0.956
Pythia-70M	0.728	0.898	0.981	0.986	0.507	0.967
Pythia-410M	0.695	0.874	0.879	0.903	0.501	0.948
Qwen2-0.5B	0.684	0.922	0.933	0.965	0.717	0.981
Qwen2-7B	0.230	0.549	0.395	0.569	0.471	0.622
Gemma2-2B	0.391	0.769	0.942	0.956	0.910	0.952

Equations11

c (n) := E_{(x_{original}, x_{patch}) \sim D} [L (M (x_{original} ∣ do (n \leftarrow n (x_{patch})))) - L (M (x_{original}))] .

c (n) := E_{(x_{original}, x_{patch}) \sim D} [L (M (x_{original} ∣ do (n \leftarrow n (x_{patch})))) - L (M (x_{original}))] .

\overset{c}{^}_{AtP} (n) := E_{(x_{original}, x_{patch}) \sim D} [(n (x_{patch}) - n (x_{original}))^{⊤} \frac{\partial L ( M ( x _{original} ))}{\partial n}_{n = n (x_{original})}] .

\overset{c}{^}_{AtP} (n) := E_{(x_{original}, x_{patch}) \sim D} [(n (x_{patch}) - n (x_{original}))^{⊤} \frac{\partial L ( M ( x _{original} ))}{\partial n}_{n = n (x_{original})}] .

f_{j}^{(l)} (a^{(l - 1)}) \approx i \sum J_{j i}^{(l)} a_{i}^{(l - 1)} + \tilde{b}_{j}^{(l)}, J_{j i}^{(l)} = \frac{\partial f _{j}^{(l)}}{\partial a _{i}^{(l - 1)}}_{\tilde{a}^{(l - 1)}} .

f_{j}^{(l)} (a^{(l - 1)}) \approx i \sum J_{j i}^{(l)} a_{i}^{(l - 1)} + \tilde{b}_{j}^{(l)}, J_{j i}^{(l)} = \frac{\partial f _{j}^{(l)}}{\partial a _{i}^{(l - 1)}}_{\tilde{a}^{(l - 1)}} .

R_{i}^{(l - 1)} = a_{i}^{(l - 1)} ρ_{i} .

R_{i}^{(l - 1)} = a_{i}^{(l - 1)} ρ_{i} .

\overset{c}{^}_{RelP} (n)

\overset{c}{^}_{RelP} (n)

\displaystyle=\mathbb{E}_{(x_{\text{original}},x_{\text{patch}})\sim D}\Big[{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathcal{R_{\text{RelP}}}(\mathcal{L}(\mathcal{M}(x_{\text{original}})))}\Big|_{n(x_{\text{original}})}\Big],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\workshoptitle

Mechanistic Interpretability Workshop at NeurIPS 2025

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari1,2

&Oliver Eberle1,2

&Ashkan Khakzar

&Neel Nanda

1Machine Learning Group, Technische Universität Berlin

2BIFOLD – Berlin Institute for the Foundations of Learning and Data Correspondence to: [email protected]

Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network’s output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG. Code is available at https://github.com/FarnoushRJ/RelP.git.

1 Introduction

Recent advances in machine learning continue to rely on transformer-based language models, which achieve remarkable performance across a wide range of tasks. Given their widespread adoption, understanding the internal mechanisms of these models is an important challenge for improving our ability to interpret, trust, and control them.

To address this challenge, the fields of eXplainable Artificial Intelligence (XAI) [1, 2] and, more recently, mechanistic interpretability [3, 4] have emerged, aiming to reveal the decision-making processes of such complex architectures. Early efforts in XAI focused on developing feature attribution methods, often visualized as heatmaps, to highlight relevant features in classification models. As state-of-the-art models have grown more complex, approaches that extend beyond input explanations have become increasingly important for uncovering their internal strategies. This has motivated a shift toward exploring the complex internal structure of neural networks through methods such as concept-based interpretability [5, 6], higher-order explanations [7, 8, 9], and circuit discovery [3, 10]. Herein, a central goal of interpretability research is the reliable localization of specific model components responsible for particular functions, algorithms or behaviors [3, 11, 4, 12, 13]. To identify components that are causally involved in inference, mechanistic interpretability has proposed activation patching, also known as causal mediation analysis [14, 15, 16]. Activation patching replaces the activations of selected components in an original input with those from a patch input, allowing direct causal testing of the contribution of a component to the model’s output.

As these patching-based interventions are computationally expensive, it is challenging to scale them, especially for larger models. To improve efficiency, attribution patching [17] was introduced as a fast gradient-based approximation to activation patching. Using gradients to estimate the influence of each component on the model’s output, attribution patching significantly reduces the computational cost compared to traditional perturbation-based approaches. However, this efficiency comes at the expense of robustness and accuracy, particularly in deep models like state-of-the-art LLMs [18]. This issue is not unique to attribution patching. Many gradient-based attribution methods [19, 20, 21, 22, 23] suffer from a lack of robustness in large networks, often due to noisy and unreliable gradients [24] that undermine the quality of explanations [25, 26, 27].

Alternative attribution methods have been proposed to overcome these limitations. One of the most widely used methods is Layer-wise Relevance Propagation (LRP) [28], developed within the theoretical framework of Deep Taylor Decomposition (DTD) [29]. LRP works by propagating the model’s output backward through the network, redistributing the relevance of each component to its inputs according to layer-specific propagation rules. These rules are designed to enforce desirable properties such as conservation of total relevance, sparsity of explanations, and robustness to noise.

Building on these insights, we propose Relevance Patching (RelP), a technique that replaces local gradients in attribution patching with propagation coefficients achieved using LRP. RelP enables more faithful localization of influential components in large models, without sacrificing scalability. Similar to attribution patching, RelP requires two forward passes and one backward pass. We demonstrate the effectiveness of RelP in the context of the Indirect Object Identification task, showing that it consistently outperforms attribution patching across a range of architectures and model sizes, including GPT-2 {Small, Medium, Large} [30], Pythia-{70M, 410M} [31], Qwen2-{0.5B, 7B} [32], and Gemma2-2B [33], with the strongest gains on the residual stream and MLP outputs. In addition, we apply RelP to recover sparse feature circuits responsible for Subject–Verb Agreement in the Pythia-70M model. Compared to Integrated Gradients [21], RelP demonstrates comparable faithfulness in identifying meaningful circuits while also offering greater computational efficiency.

2 Related Works

Activation and Attribution Patching Techniques

Activation patching is a causal mediation technique that tests the role of model components by replacing activations from an original run with those from a patch run [34, 35]. It has been widely used in interpretability studies [15, 36, 37, 38], with variants such as causal tracing, which perturbs activations with Gaussian noise [16], and path patching, which extends interventions to multi-step computation paths [39, 40]. To improve efficiency, attribution patching (AtP) uses gradient-based attribution instead of full interventions, requiring only two forward passes and one backward pass [17], and AtP* further enhances reliability while maintaining scalability [18]. Building on this line of work, we propose Relevance Patching (RelP), which also targets efficiency but replaces the local gradients in attribution patching with propagation coefficients computed using Layer-wise Relevance Propagation (LRP) [28], improving faithfulness to activation patching while maintaining the computational advantages of AtP.

Circuit Analysis

Circuit analysis methods aim to uncover subnetworks that drive particular model behaviors. Activation patching, used in ACDC [41], identifies critical edges in the computation graph but is computationally expensive. It has been scaled to large models [42] and adapted to study the effects of fine-tuning [43]. Sparse Autoencoders (SAEs) offer an alternative by learning disentangled, interpretable latent features [44, 45]. Methods such IFR [46] and EAP [47] instead use gradient-based attributions to perform circuit discovery in a more efficient way. Finally, attribution-guided pruning with LRP has been shown to support both circuit discovery and model compression [48]. Our proposed Relevance Patching (RelP) also builds on LRP but stays within the patching framework, providing faithful and efficient attribution that approximates activation patching.

Gradient-Based and Propagation-Based Attribution Methods

Local gradients have long been used to explain the behavior of non-linear models [23]. Through backpropagation, they enable the efficient computation of saliency maps [28, 49], which have been extensively studied in the context of vision models. Saliency maps derived from raw gradients [49] capture the sensitivity of model predictions to small perturbations in input features. When these gradients are scaled by the corresponding input values, they yield Gradient $\times$ Input [22] explanations. Grad-CAM [20] aggregates class-specific gradients over the final convolutional feature maps, yielding coarse but discriminative heatmaps. Since raw gradients are often noisy, SmoothGrad [19] averages saliency maps over random perturbations, PathwayGrad [50] finds a sub-network (pathway) critical for the output and propagates gradients through that pathway, and Integrated Gradients [21] integrates along a baseline-to-input path to mitigate saturation and enforce key attribution axioms. Beyond gradient-based methods, propagation-based approaches such as LRP [28] and DeepLIFT [51] redistribute the model’s output backward through the network using local propagation rules, which can offer more faithful attribution. For a comprehensive review of explanation methods and attribution techniques, see [52, 1, 53, 2].

3 Background

From Activation to Attribution Patching

Let $\mathcal{M}:X\rightarrow\mathbb{R}^{V}$ be a decoder-only transformer model that maps an input sequence $x\in X:=\{1,...,V\}^{T}$ to a vector of output logits over a vocabulary of size $V$ . We represent the model as a directed computation graph $G=(N,E)$ , where each node $n\in N$ corresponds to a distinct model component, and each edge $(n_{1},n_{2})\in E$ represents a direct computational dependency. The activation of a component $n$ when processing input $x$ is denoted $n(x)$ .

Let $D$ be a distribution over prompt pairs $(x_{\text{original}},x_{\text{patch}})$ , where $x_{\text{original}}$ is a representative task sample that captures the behavior of interest, and $x_{\text{patch}}$ serves as a reference input used to introduce counterfactual perturbations for causal intervention. A metric $\mathcal{L}:\mathbb{R}^{V}\rightarrow\mathbb{R}$ quantifies the output behavior of the model.

The causal contribution $c(n)$ of a component $n\in N$ is defined as the expected effect of replacing its activation in the original input with the corresponding activation from the patch input, i.e.,

[TABLE]

This definition follows the causal intervention framework, where $\text{do}(n\leftarrow n(x_{\text{patch}}))$ denotes overwriting the activation of $n$ with its value under the patch input. Evaluating $c(n)$ directly for all $n\in N$ is computationally prohibitive for large-scale models, motivating the development of efficient approximation methods such as attribution patching (AtP) [17]:

[TABLE]

AtP (Eq. 2) approximates the effect of replacing component $n$ ’s activation from the original input with that from the patch input, without explicitly running the patched model, by leveraging gradients.

Layer-wise Relevance Propagation (LRP)

Layer-wise Relevance Propagation (LRP) [28] is a framework for interpreting neural network predictions. The central idea is to assign a relevance score $\mathcal{R}$ to each component or input feature, quantifying its contribution to the model’s output or to any metric defined from the output. These relevance scores are then propagated backward through the network according to layer-specific rules. The propagation follows the principle of relevance conservation [28, 29], ensuring that the total relevance at each layer is completely redistributed to the preceding layer, ultimately attributing the chosen metric to the input features.

To formalize this process, let $a_{i}^{(l-1)}$ with $i\in\{1,...,n\}$ denote the activations in layer $l-1$ . These activations serve as inputs to a layer represented by $f^{(l)}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ , resulting in outputs $a_{j}^{(l)}$ , $j\in\{1,...,m\}$ . A local first-order Taylor expansion around a reference point $\tilde{a}^{(l-1)}$ expresses each output as

[TABLE]

Here, $J^{(l)}_{ji}$ is the local Jacobian, describing how activations $a_{i}^{(l-1)}$ influences outputs $a_{j}^{(l)}$ , while $\tilde{b}^{(l)}_{j}$ collects constant terms and approximation errors.

This decomposition motivates the LRP redistribution step. Each component $j$ in layer $l$ is assigned a relevance score $\mathcal{R}^{(l)}_{j}$ , which is redistributed to its inputs in proportion to their contributions to the activation of component $j$ . Here, this redistribution from upper-layer activations to component $i$ is expressed through a propagation coefficient $\rho_{i}$ , which aggregates the contributions from all connected upper-layer components $j$ . The relevance of component $i$ in layer $l-1$ is then given by

[TABLE]

The propagation coefficient $\rho_{i}$ depends on the local Jacobian $J^{(l)}$ , the activations $a^{(l-1)}$ , and the relevance scores at layer $l$ (i.e., $\mathcal{R}^{(l)}$ ). Intuitively, $\rho_{i}$ acts as a filter that determines which parts of the input activations are considered relevant.

The layer-wise redistribution of relevance can be repeated until the input layer is reached or stopped at an intermediate layer of interest. The procedure begins with an initial relevance signal $\mathcal{R}^{(L)}_{c}$ , typically defined using a metric computed from the model outputs, commonly the logit of the true, predicted, or any target class $c$ at the output layer [28, 29]. Contrastive approaches have also been proposed, which explain the difference between class logits to highlight features that strongly influence changes in the model’s prediction [26, 54].

4 RelP: Relevance Patching for Mechanistic Analysis

Gradient-based attribution methods have been extensively studied in the XAI literature [23, 22, 19, 21, 20]. Despite their simplicity and model-agnostic applicability, these methods often face challenges when applied to large neural networks due to noisy and less reliable gradients [25, 27]. LRP mitigates these shortcomings by propagating the network’s output $\mathcal{M}(x)$ , or any metric $\mathcal{L}$ defined from the output, backward through the network in a structured and theoretically grounded manner. Depending on the layers in a given model architecture, different propagation rules have been proposed. These rules are typically derived from gradient analyses of model components [29, 26, 55], with specific choices guided by common desiderata for explanations, such as sparsity, reduced noise, and relevance conservation [56, 52].

Specialized propagation rules have been developed to handle diverse nonlinear components (e.g., activation functions, attention mechanisms, normalization layers) and architectural nuances across deep network families [57, 55, 27, 52, 7, 58, 8]. An overview of relevant rules for transformer models is shown in Table 1.

As attribution patching relies on gradients to approximate the effect of substituting hidden activations (“patching in”), it is also susceptible to the noise and approximation errors inherent to gradient-based attribution methods, particularly in very large models [24, 29]. To address this, we introduce Relevance Patching (RelP).

Formally, RelP follows the structure of standard attribution patching (AtP). In AtP, the contribution of a component is computed as the dot product between the change in its activation, caused by replacing the original input with a patch input, and the gradient of the metric $\mathcal{L}$ with respect to that component, evaluated at the original input. RelP modifies this procedure by substituting the gradient term with the LRP-derived propagation coefficient (introduced in Section 3) for that component. The resulting contribution score is defined as

[TABLE]

where $\rho(\mathcal{L}(\mathcal{M}(x_{\text{original}})))$ and $\mathcal{R}_{\text{RelP}}(\mathcal{L}(\mathcal{M}(x_{\text{original}})))$ denote, respectively, the propagation coefficient and the relevance score for component $n$ .

RelP preserves the efficiency of attribution patching while improving faithfulness. Unlike Marks et al. [44], which relies on Integrated Gradients [21] for more accurate approximations at higher computational cost, RelP provides scalable and faithful localization of influential components.

Propagation Rules

Table 1 summarizes the propagation rules applicable to core components of transformer architectures. As discussed in Section 3, relevance conservation is a key property in LRP, ensuring that the sum of relevance scores is preserved across layers. This prevents relevance from being lost or artificially introduced during propagation. Many of the specialized rules in Table 1 are designed to address situations where this property could fail, and are often derived from gradient analyses of individual model components.

In attention heads, attention outputs depend on inputs directly and via attention weights $A_{ij}$ , which are also input-dependent. Correlations between these terms can break conservation and the AH‑rule preserves it by treating $A_{ij}$ as constants, effectively linearizing the heads [26]. LayerNorm’s centering and variance-based scaling can cause “relevance collapse”, which the LN‑rule mitigates this by treating $(\sqrt{\epsilon+\mathrm{Var}[x]})^{-1}$ as constant, allowing the operation to be seen as linear and preserving conservation [26]. For multiplicative gates, the Half‑rule splits relevance equally to avoid spurious doubling [55, 57]. Activation functions (e.g., GELU, SiLU) can disrupt conservation since their output is a nonlinear transformation of the input, the Identity‑rule addresses this by treating the non‑linear component as constant [55]. For linear layers, the [math]‑rule is equivalent to Gradient $\times$ Input [22]. The $\epsilon$ ‑rule and $\gamma$ ‑rule extend the [math]‑rule, the $\epsilon$ ‑rule is applied when sparsity and noise reduction are desired, while the $\gamma$ ‑rule is used when it is preferable to emphasize positive contributions over negative ones [52]. For additional rules specifically tailored to transformers, we refer the readers to [26, 27, 59].

5 Experiments

We evaluate the effectiveness of RelP by benchmarking it against attribution patching (AtP), on the Indirect Object Identification (IOI) task, across three GPT‑2 variants (Small, Medium, and Large), two Pythia models (70M and 410M), two Qwen2 models (0.5B and 7B), and Gemma2-2B. Furthermore, we assess RelP’s ability to recover sparse feature circuits underlying Subject–Verb Agreement in the Pythia-70M model, comparing its performance against Integrated Gradients (IG) as a more accurate gradient-based baseline. Additional details of our experimental setup are provided in Appendix A.

5.1 Evaluating RelP on the IOI Task

The objective of this experiment is to evaluate how well attribution patching (AtP) and relevance patching (RelP) approximate the ground-truth effects captured by activation patching (AP) on the IOI task. This task involves determining which entity in a sentence functions as the indirect object, typically the recipient or beneficiary of the direct object.

To enable controlled evaluations, we construct 100 prompt pairs. Each pair consists of an original prompt (with the correct indirect object) and a patch variant in which the subject and indirect object are swapped, keeping all other tokens the same. All prompts are generated from structured templates [10] (Table 2) to ensure consistency across examples. We define the metric $\mathcal{L}$ as the difference in logits between the original and patch targets. The methods RelP, AtP, and AP are each applied to the residual stream, attention outputs, and MLP outputs. Since the resulting attribution scores are computed across hidden dimensions, we aggregate them by summing over these dimensions. To quantify how well AtP and RelP approximate AP, we compute the Pearson correlation coefficient (PCC) between their results and those of AP.

The results of this experiment are presented in Figure 1. As shown, RelP consistently outperforms AtP in a range of architectures and model sizes, including GPT-2 {Small, Medium, Large}, Pythia-{70M, 410M}, Qwen2-{0.5B, 7B}, and Gemma2-2B. The performance gap is especially pronounced in the MLP output and residual stream analyses. Detailed numerical results are listed in Table 5 in Appendix B.

Qualitative differences between AtP and RelP are also visible in Figure 2. Nanda [17] suggests that attribution patching fails notably in the residual stream, primarily due to its large activations and the nonlinearity introduced by LayerNorm, which significantly disrupts the underlying linear approximation. It also performs poorly on MLP0, since this layer in GPT-2 Small functions as an “extended embedding”, and Nanda [17] has shown that linear approximations in this layer may become unstable, leading to misleading insights. As can be seen in Figure 2, the RelP results align more closely with those from activation patching, especially when considering the residual stream and MLP outputs (e.g., MLP0). Further qualitative results are provided in Appendix C.

Overall, we find that the proposed method, RelP, consistently achieves strong correlations with activation patching and is often clearly superior to standard attribution patching. This underscores the effectiveness of the tailored propagation rules in the LRP framework, which provide a reliable and computationally efficient approximation of the computationally expensive activation patching method across models and components commonly studied in mechanistic interpretability research.

5.2 Sparse Feature Circuits for Subject-Verb Agreement

Sparse feature circuits are small, interconnected groups of interpretable features that jointly drive specific behaviors in language models. Analyzing these circuits allows us to understand how models combine meaningful components to solve tasks, offering insight into the underlying mechanisms behind their decisions. In this experiment, we aim to uncover feature circuits involved in the subject-verb agreement task, a linguistic evaluation that tests a model’s ability to correctly match a verb’s inflection (e.g., singular or plural) to the grammatical number of its subject.

To identify and analyze these circuits, we use features discovered using sparse autoencoders (SAEs), which provide fine-grained, human-interpretable units. We follow [44] and use their SAEs trained on Pythia-70M language model and their sparse features in our experiments. These sparse features serve as the fundamental units for circuit analysis. To assess their influence on the model’s output, Marks et al. [44] employed efficient linear approximations of activation patching, such as Integrated Gradients (IG), to estimate the contributions of individual features and their interactions. Although IG provides greater faithfulness than attribution patching, it requires multiple forward and backward passes, resulting in higher computational cost. Our experiments show that RelP achieves faithfulness on par with attribution patching, without incurring additional computational overhead.

We use three templatic datasets (Table 3), where tokens at the same positions serve similar roles, allowing us to average node and edge effects across examples while retaining positional information. Consistent with [44], we assess the identified sparse feature circuits using faithfulness and completeness evaluation metrics [10]. Additionally, we compare our sparse feature circuits with neuron circuits, constructed by applying the same discovery method directly to individual neurons rather than to SAE features, serving as a baseline for evaluation.

Faithfulness measures how much of the model’s original performance is captured by the discovered circuit, relative to a baseline where the entire model is mean-ablated. It is calculated as $\frac{\mathcal{L}(C)-\mathcal{L}(\emptyset)}{\mathcal{L}(M)-\mathcal{L}(\emptyset)}$ , where $\mathcal{L}(C)$ is the metric when only the circuit $C$ is active, $\mathcal{L}(\emptyset)$ is the metric of the fully mean-ablated model, and $\mathcal{L}(\mathcal{M})$ is the metric of the full model.

Completeness evaluates how much of the model’s behavior is not explained by the discovered circuit. It is defined as the faithfulness of the circuit’s complement, $(\mathcal{M}\backslash C)$ . In other words, if the complement of the circuit still explains a lot of the behavior, then the original circuit is not complete in capturing the mechanism.

We report faithfulness scores for feature and neuron circuits as the node threshold $T_{N}$ is varied (Figure 3). The threshold keeps only nodes with contribution scores above $T_{N}$ , retaining the most relevant ones. Consistent with [44], small feature circuits explain most of the model’s behavior: in Pythia-70M, about 100 features account for the majority of performance, whereas roughly 1,500 neurons are needed to explain half.

SAE error nodes summarize reconstruction errors in the SAE, making them fundamentally different from single neurons. To analyze their impact, we evaluate faithfulness after removing all SAE error nodes, as well as those originating from attention and MLP modules. Consistent with findings by Marks et al. [44], we observe that removing residual stream SAE error nodes severely disrupts model performance, whereas removing MLP and attention error nodes has a less pronounced effect.

Notably, the circuits identified by our RelP method achieve faithfulness scores comparable to those obtained with IG, while offering greater computational efficiency. In our experiments, IG requires 10 integration steps, increasing computational overhead. In contrast, RelP matches the efficiency of AtP, requiring only two forward passes and one backward pass, while achieving faithfulness scores on par with IG. In the case of neuron circuits and feature circuits without attention/MLP errors, we observe an improvement in the faithfulness of RelP compared to IG. This combination of faithfulness and efficiency makes RelP a more effective and practical approach for uncovering sparse feature circuits in large language models.

Following [44], we also evaluate completeness (Figure 3) and observe that ablating only a few nodes from feature circuits can drastically reduce model performance, whereas hundreds of neurons are needed for a similar effect for Pythia. This highlights the efficiency of the identified sparse feature circuits in capturing critical model behavior, as well as the usefulness of RelP in locating them.

6 Conclusion

In this paper, we introduced Relevance Patching (RelP) as an enhancement over the standard attribution patching method. RelP preserves the overall algorithmic structure of attribution patching while replacing its local gradient signal with the propagation coefficient derived from Layer-wise Relevance Propagation (LRP). This modification enables RelP to more accurately approximate the causal effects captured by activation patching, while remaining computationally efficient.

In our experiments on the Indirect Object Identification task, we demonstrated that RelP exhibits stronger alignment with activation patching across residual stream, attention, and MLP outputs compared to attribution patching. Furthermore, in comparing sparse feature circuits identified by RelP and Integrated Gradients (IG), we showed that RelP achieves comparable faithfulness without incurring the additional computational cost associated with IG. Our study thus demonstrates that methods developed for feature attribution can be effectively integrated into mechanistic interpretability, helping to advance our understanding of modern foundation models.

7 Limitations and Future Work

Applying RelP to localize relevant model components requires choosing appropriate rules within the LRP framework, which introduces some model-specific overhead compared to model-agnostic attribution methods. In our experiments, RelP consistently outperformed attribution patching on residual stream and MLP outputs, while gains for attention outputs were smaller. This suggests that a deeper analysis of gradient structure in self-attention, combined with more careful rule selection, could improve attribution faithfulness. Prior works have examined these challenges in the context of input-level feature attribution [26, 27, 55], but systematic studies of internal layers remain limited. For our experiments, we used small- and medium-scale open-access models, which allow full access to internals and gradients. The tasks were chosen to match standard circuit analysis benchmarks, relying on predefined original–patch input pairs and single-token prediction metrics. As a result, circuit discovery was limited to moderately complex behaviors. Extending RelP to more challenging settings, such as free-form text generation or scenarios without known counterfactual inputs, remains an important direction for future work.

Acknowledgments and Disclosure of Funding

We acknowledge support by the Federal Ministry of Research, Technology and Space (BMFTR) for BIFOLD (ref. 01IS18037A). F.RJ. was partly supported by MATS 8.0 program during which foundational experiments for this work were completed.

Appendix A Experimental Details

A.1 LRP Implementation

We applied the LN-rule [26] for LayerNorm/RMSNorm, the Identity-rule [55] for nonlinear activation functions, and the LRP-[math] rule [52] for linear transformations. The Half-rule [55, 57] was applied only to the Qwen2 model family and Gemma2-2B, as other models used in our experiments did not include multiplicative gating operations. No attention-specific rule (e.g., AH-rule) was used. Full details are given in Table 4.

A.2 Further Details on Subject-Verb Agreement Experiment

In this experiment, we used 300 samples, structured according to Table 3, for circuit discovery, and 100 held-out samples for evaluating faithfulness and completeness. Consistent with [44], the first one-third of each circuit was excluded from evaluation, since components in early model layers are typically responsible for processing specific tokens, which may not consistently appear across training and test splits.

Appendix B Quantitative results

The exact numerical values corresponding to the IOI experiment, visualized in Figure 1, are reported in Table 5.

Appendix C Further Qualitative Results

In Section 5, we presented the qualitative differences between AtP and RelP for the GPT-2 Small model. In this section, we extend our analysis by providing additional experimental results for other models. It is evident that RelP provides a more accurate approximation to activation patching, particularly for the residual stream and MLP outputs.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Samek et al. [2021] Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE , 109(3):247–278, 2021. doi: 10.1109/JPROC.2021.3060483 .
2Longo et al. [2024] Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, et al. Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion , 106:102301, 2024.
3Olah et al. [2020] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill , 2020. doi: 10.23915/distill.00024.001 . https://distill.pub/2020/circuits/zoom-in.
4Sharkey et al. [2025] Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. ar Xiv preprint ar Xiv:2501.16496 , 2025.
5Chormai et al. [2024] Pattarawat Chormai, Jan Herrmann, Klaus-Robert Müller, and Grégoire Montavon. Disentangled explanations of neural network predictions by finding relevant subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46(11):7283–7299, 2024.
6Kopf et al. [2025] Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina MC Höhne, and Oliver Eberle. Capturing polysemanticity with PRISM: A multi-concept feature description framework. In The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2025.
7Eberle et al. [2020] Oliver Eberle, Jochen Büttner, Florian Kräutli, Klaus-Robert Müller, Matteo Valleriani, and Grégoire Montavon. Building and interpreting deep similarity models. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(3):1149–1161, 2020.
8Schnake et al. [2022] Thomas Schnake, Oliver Eberle, Jonas Lederer, Shinichi Nakajima, Kristof T. Schütt, Klaus-Robert Müller, and Grégoire Montavon. Higher-order explanations of graph neural networks via relevant walks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(11):7581–7596, 2022.