An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau

TL;DR
This paper investigates how memory management in small transformers affects interpretability, revealing that certain heads erase residual information, which can mislead direct logit attribution methods.
Contribution
It provides concrete evidence of erasure phenomena in a 4-layer transformer and highlights limitations of direct logit attribution in this context.
Findings
Identification of heads that consistently erase residual information
Demonstration that DLA can produce misleading interpretations due to erasure
Evidence of memory management mechanisms in small transformers
Abstract
Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
MethodsDeep Layer Aggregation
