Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu; Zekun Wang; Bo Zheng; Zeyu Huang; Kaiyue Wen; Songlin Yang; Rui Men; Le Yu; Fei Huang; Suozhi Huang; Dayiheng Liu; Jingren Zhou; Junyang Lin

arXiv:2505.06708·cs.CL·May 13, 2025

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin

PDF

Open Access 1 Repo 4 Models

TL;DR

This paper systematically investigates gating mechanisms in softmax attention, demonstrating that a simple head-specific sigmoid gate improves model performance, stability, and scalability by introducing beneficial non-linearity and sparsity.

Contribution

It provides the first comprehensive analysis of gating effects in large language models, revealing how simple gating modifications enhance performance and training stability.

Findings

01

Sigmoid gating improves performance across models.

02

Gating enhances training stability and scalability.

03

Sparse gating mitigates attention sink and improves long-context extrapolation.

Abstract

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiuzh20/gated_attention
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Big Data and Digital Economy

MethodsAttention Is All You Need · Highway networks · Softmax