Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

Alfred Shen; Aaron Shen

arXiv:2601.15305·cs.AI·January 23, 2026

Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

Alfred Shen, Aaron Shen

PDF

Open Access

TL;DR

Gated Sparse Attention (GSA) combines sparse and gated attention mechanisms to improve efficiency, training stability, and model quality in long-context language models, with theoretical guarantees and empirical validation.

Contribution

GSA introduces a novel architecture that integrates gating with sparse attention, providing theoretical analysis and demonstrating significant efficiency and quality improvements.

Findings

01

Achieves 12-16x speedup at 128K context length

02

Reduces attention to the first token from 47% to under 4%

03

Improves perplexity from 6.03 to 5.70 and stabilizes training

Abstract

The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that improve training sta-bility while mitigating the attention sink phenomenon. We observe that these approaches address complementary weaknesses and propose Gated Sparse Attention (GSA), an architecture that realizes the benefits of both. GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores, an adaptive sparsity controller that modulates the number of attended tokens based on local uncertainty, and dual gating at the value and output stages. We establish theoretical foundations for the approach, including complexity analysis, expressiveness results, and convergence guarantees. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning