Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Zizhuo Fu; Wenxuan Zeng; Runsheng Wang; Meng Li

arXiv:2602.01203·cs.CL·May 5, 2026

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

PDF

TL;DR

This paper reveals that the attention sink phenomenon in large language models naturally forms a Mixture-of-Experts structure within attention layers, and proposes sink-aware training to mitigate head collapse and improve performance.

Contribution

It provides a theoretical and empirical analysis linking attention sink to MoE structures and introduces a sink-aware training method to address head collapse in attention layers.

Findings

01

Attention sink naturally constructs a Mixture-of-Experts within attention layers.

02

Sink-aware training improves head load balancing and model performance.

03

The method is effective across various attention mechanisms.

Abstract

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.