Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

Donald Ye

arXiv:2602.01442·cs.LG·May 12, 2026

Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

Donald Ye

PDF

TL;DR

This paper reveals that gradient-based attribution methods systematically misrepresent layer importance in transformers, overemphasizing early layers and undervaluing late layers due to redundancy issues, which questions their reliability for interpretability.

Contribution

It provides the first causal evaluation of gradient attribution in transformers, uncovering layer-wise failures and the impact of redundancy on attribution accuracy.

Findings

01

Gradient attribution overvalues early-layer Gradient Bloats.

02

Late-layer Hidden Heroes are undervalued by gradient attribution.

03

Redundancy causes joint Bloat ablation to be more damaging than individual ablations.

Abstract

Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ = 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ = - 0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14 \times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.