An Adversarial Example for Direct Logit Attribution: Memory Management   in GELU-4L

Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau

arXiv:2310.07325·cs.LG·December 17, 2024

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau

PDF

Open Access

TL;DR

This paper investigates how memory management in small transformers affects interpretability, revealing that certain heads erase residual information, which can mislead direct logit attribution methods.

Contribution

It provides concrete evidence of erasure phenomena in a 4-layer transformer and highlights limitations of direct logit attribution in this context.

Findings

01

Identification of heads that consistently erase residual information

02

Demonstration that DLA can produce misleading interpretations due to erasure

03

Evidence of memory management mechanisms in small transformers

Abstract

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning

MethodsDeep Layer Aggregation