Active-Dormant Attention Heads: Mechanistically Demystifying   Extreme-Token Phenomena in LLMs

Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song; Mei

arXiv:2410.13835·cs.LG·November 8, 2024

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song, Mei

PDF

Open Access 1 Repo

TL;DR

This paper investigates extreme-token phenomena in large language models, revealing their mechanisms through simple models and extending insights to pretrained LLMs, and proposes mitigation strategies during pretraining.

Contribution

It introduces an active-dormant mechanism explaining extreme-token phenomena and demonstrates its presence in both toy models and real LLMs, offering mitigation strategies.

Findings

01

Extreme-token phenomena arise in simple transformer models.

02

An active-dormant mechanism explains attention sink behavior.

03

Mitigation strategies like ReLU and SGD can reduce phenomena.

Abstract

Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guotianyu2000/active-dormant-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need · LLaMA · Softmax · Stochastic Gradient Descent · ALIGN · Adam