GIM: Improved Interpretability for Large Language Models

Joakim Edin; R\'obert Csord\'as; Tuukka Ruotsalo; Zhengxuan Wu; Maria Maistro; Casper L. Christensen; Jing Huang; Lars Maal{\o}e

arXiv:2505.17630·cs.CL·October 2, 2025

GIM: Improved Interpretability for Large Language Models

Joakim Edin, R\'obert Csord\'as, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Casper L. Christensen, Jing Huang, Lars Maal{\o}e

PDF

3 Reviews

TL;DR

This paper introduces Gradient Interaction Modifications (GIM), a novel method that enhances interpretability of large language models by addressing self-repair phenomena within attention mechanisms, leading to more faithful attribution of model components.

Contribution

GIM is a new technique that accounts for self-repair during backpropagation, improving faithfulness of interpretability methods for large language models.

Findings

01

GIM outperforms existing attribution methods across multiple LLMs.

02

GIM reveals more accurate importance of attention scores.

03

Enhanced interpretability aids in understanding and improving LLMs.

Abstract

Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The approach addresses redundancy-driven cancellation inside the attention mechanism, and integration seems lightweight e.g no model retraining and only small code edits. The paper spans multiple model families and datasets with cumulative analyses. The Improvements in comprehensiveness/sufficiency suggest better alignment with model behaviour.

Weaknesses

Most evaluations are short-context benchmarks; it’s unclear how well the approach holds up in longer contexts. Section 5.2 introduces a 0.1 threshold to decide when to treat effects as joint, which helps but feels coarse, as it is a global heuristic with no principled grounding. Over and underaggregation might occur. The approach should be tested to determine whether attributions stay stable under light, meaning-preserving edits (paraphrases, synonym swaps, punctuation/tokenisation noise) and

Reviewer 02Rating 4Confidence 3

Strengths

1. The mathematical explanation of the self repair mechanism for attention weights is clear and convincing. 2. The proposed TSG solution is well motivated theoretically. 3. The breadth of models and datasets is quite comprehensive. 4. GIM as a whole often outperforms prior attribution methods in comprehensiveness and sufficiency across a variety of models and datasets.

Weaknesses

1. The depth of analysis for the attention self-repair effect is somewhat weak. While the same analysis is present across multiple datasets and models, it seems limited to comparing the joint and sum of individual ablation effects. Are there any other analyses that the authors investigated to better understand this novel phenomenon, such as the average group size of similarly contributing values with substantial attention weights, or how many groups of such values there tend to be? 2. Looking at

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper identifies self-repair due to attention as a clear failure-mode of conventional gradient attributions and illustrates it with a toy example (Figure 1). 2. It shows that self-repair occurs in a variety of LLMs and tasks (Figure 2a and appendices). 3. The paper combines several existing tools (layernorm freeze, gradient normalization) and the novel TSG into a single method (GIM). 4. The metrics (sufficiency and comprehensiveness) seem well-chosen for this problem, directly benchmar

Weaknesses

1. The paper describes TSG as an empirically-developed method that provides better attributions for OR-gate-like components. This seems well-supported by the mathematical and empirical arguments. However, I wonder whether this method is addressing a symptom or fixing the underlying issue of the attribution: Does TSG still work on non-OR-gate-like settings? Does it make attributions worse in some settings? This is somewhat discussed in the discussion section but left for future work; I would expe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.