The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li; Kaiqi Jiang; Jiacheng Sun; Tianyang Hu

arXiv:2605.06611·cs.LG·May 8, 2026

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

PDF

TL;DR

This paper uncovers the structural causes of the attention sink phenomenon in Large Language Models, linking it to variance discrepancies caused by value aggregation and super neurons, and proposes a normalization method to mitigate it.

Contribution

It provides a mechanistic explanation for attention sinks, identifies their root causes, and introduces head-wise RMSNorm to improve training stability and convergence.

Findings

01

Attention sink is caused by variance discrepancy in value aggregation.

02

Super neurons amplify variance discrepancies, leading to attention sinks.

03

Head-wise RMSNorm stabilizes value aggregation and accelerates training.

Abstract

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.