Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song, Mei

TL;DR
This paper investigates extreme-token phenomena in large language models, revealing their mechanisms through simple models and extending insights to pretrained LLMs, and proposes mitigation strategies during pretraining.
Contribution
It introduces an active-dormant mechanism explaining extreme-token phenomena and demonstrates its presence in both toy models and real LLMs, offering mitigation strategies.
Findings
Extreme-token phenomena arise in simple transformer models.
An active-dormant mechanism explains attention sink behavior.
Mitigation strategies like ReLU and SGD can reduce phenomena.
Abstract
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Photolithography Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need · LLaMA · Softmax · Stochastic Gradient Descent · ALIGN · Adam
