Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
Donald Ye

TL;DR
This paper reveals that gradient-based attribution methods systematically misrepresent layer importance in transformers, overemphasizing early layers and undervaluing late layers due to redundancy issues, which questions their reliability for interpretability.
Contribution
It provides the first causal evaluation of gradient attribution in transformers, uncovering layer-wise failures and the impact of redundancy on attribution accuracy.
Findings
Gradient attribution overvalues early-layer Gradient Bloats.
Late-layer Hidden Heroes are undervalued by gradient attribution.
Redundancy causes joint Bloat ablation to be more damaging than individual ablations.
Abstract
Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from on sequence reversal to on sequence sorting, reaching in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
