The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

TL;DR
This paper uncovers the structural causes of the attention sink phenomenon in Large Language Models, linking it to variance discrepancies caused by value aggregation and super neurons, and proposes a normalization method to mitigate it.
Contribution
It provides a mechanistic explanation for attention sinks, identifies their root causes, and introduces head-wise RMSNorm to improve training stability and convergence.
Findings
Attention sink is caused by variance discrepancy in value aggregation.
Super neurons amplify variance discrepancies, leading to attention sinks.
Head-wise RMSNorm stabilizes value aggregation and accelerates training.
Abstract
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
