When Attention Sink Emerges in Language Models: An Empirical View
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao, Du, Ye Wang, Min Lin

TL;DR
This paper investigates the universal emergence of attention sink in language models, revealing its causes, effects, and how alternative attention mechanisms can mitigate this phenomenon.
Contribution
It provides the first comprehensive empirical analysis of attention sink in LMs, exploring its origins, influencing factors, and potential solutions through alternative attention methods.
Findings
Attention sink exists universally across various LMs and inputs.
Attention sink emerges during pre-training after effective optimization.
Replacing softmax with sigmoid attention prevents sink emergence in models up to 1B parameters.
Abstract
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly…
Peer Reviews
Decision·ICLR 2025 Spotlight
- The paper provides a thorough investigation into the attention sink phenomenon, covering various factors that influence its emergence in LMs. - The findings have practical applications in areas such as streaming/long context generation, KV cache optimization, and model quantization. - The study considers various model architectures, optimizers, and data distributions, offering a broad perspective on the attention sink phenomenon.
- The study notes that attention sink disappears with less training data, suggesting that the models might be overfitting to certain aspects of the training data. It seems that these two factors cannot be decoupled, and it is impossible to explain whether it is over-fit or the amount of data that affects attention-sink. - The study claims that it focuses on the Language Models. However, this work pay more attention on auto-regressive LMs and may not capture the nuances of other types of encoder-
1. The perspectives of studying attention sink problem is diverse and well-motivated. These perspectives are also inspiring to future studies on attention mechanisms. 2. The experiments is very comprehensive and diverse.
The primary weakness of this paper is the lack of in-depth analysis. The empirical results come across more as observations rather than as thorough investigations. Including more theoretical analysis or deeper experimental work would strengthen the paper. For instance, in the KV bias section, it would be beneficial to explore how attention variants with and without attention sink are related in formulation.
- Reproducibily. - The paper is well written and the experiments set up is soild. - The paper not only observes the presence of attention sinks but also delves into the mechanistic reasons behind their emergence (also providing insights on the training dynamics that foster attention sinks). - The paper show that attention sinks persist across various input types, like random token sequences and repeated tokens. I think that this paper is a valuable step further in understanding the phenomenon o
- The paper doesn't really take into account the role of attention sink impact on downstream tasks. - While the paper studies the emergence and immediate characteristics of attention sinks, it does not assess the long-term impacts of attention sinks on model behavior (like stability during fine-tuning, adaptability to new tasks, or resistance to adversarial attacks). Not really a weakness, but i found the paper a bit difficoult to read because of the unusual aesthetic choices: i think that this
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Softmax · Attention Sinks
