TL;DR
This paper introduces Gradient Interaction Modifications (GIM), a novel method that enhances interpretability of large language models by addressing self-repair phenomena within attention mechanisms, leading to more faithful attribution of model components.
Contribution
GIM is a new technique that accounts for self-repair during backpropagation, improving faithfulness of interpretability methods for large language models.
Findings
GIM outperforms existing attribution methods across multiple LLMs.
GIM reveals more accurate importance of attention scores.
Enhanced interpretability aids in understanding and improving LLMs.
Abstract
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models…
Peer Reviews
Decision·Submitted to ICLR 2026
The approach addresses redundancy-driven cancellation inside the attention mechanism, and integration seems lightweight e.g no model retraining and only small code edits. The paper spans multiple model families and datasets with cumulative analyses. The Improvements in comprehensiveness/sufficiency suggest better alignment with model behaviour.
Most evaluations are short-context benchmarks; it’s unclear how well the approach holds up in longer contexts. Section 5.2 introduces a 0.1 threshold to decide when to treat effects as joint, which helps but feels coarse, as it is a global heuristic with no principled grounding. Over and underaggregation might occur. The approach should be tested to determine whether attributions stay stable under light, meaning-preserving edits (paraphrases, synonym swaps, punctuation/tokenisation noise) and
1. The mathematical explanation of the self repair mechanism for attention weights is clear and convincing. 2. The proposed TSG solution is well motivated theoretically. 3. The breadth of models and datasets is quite comprehensive. 4. GIM as a whole often outperforms prior attribution methods in comprehensiveness and sufficiency across a variety of models and datasets.
1. The depth of analysis for the attention self-repair effect is somewhat weak. While the same analysis is present across multiple datasets and models, it seems limited to comparing the joint and sum of individual ablation effects. Are there any other analyses that the authors investigated to better understand this novel phenomenon, such as the average group size of similarly contributing values with substantial attention weights, or how many groups of such values there tend to be? 2. Looking at
1. The paper identifies self-repair due to attention as a clear failure-mode of conventional gradient attributions and illustrates it with a toy example (Figure 1). 2. It shows that self-repair occurs in a variety of LLMs and tasks (Figure 2a and appendices). 3. The paper combines several existing tools (layernorm freeze, gradient normalization) and the novel TSG into a single method (GIM). 4. The metrics (sufficiency and comprehensiveness) seem well-chosen for this problem, directly benchmar
1. The paper describes TSG as an empirically-developed method that provides better attributions for OR-gate-like components. This seems well-supported by the mathematical and empirical arguments. However, I wonder whether this method is addressing a symptom or fixing the underlying issue of the attribution: Does TSG still work on non-OR-gate-like settings? Does it make attributions worse in some settings? This is somewhat discussed in the discussion section but left for future work; I would expe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
