Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models
Alfred Shen, Aaron Shen

TL;DR
Gated Sparse Attention (GSA) combines sparse and gated attention mechanisms to improve efficiency, training stability, and model quality in long-context language models, with theoretical guarantees and empirical validation.
Contribution
GSA introduces a novel architecture that integrates gating with sparse attention, providing theoretical analysis and demonstrating significant efficiency and quality improvements.
Findings
Achieves 12-16x speedup at 128K context length
Reduces attention to the first token from 47% to under 4%
Improves perplexity from 6.03 to 5.70 and stabilizes training
Abstract
The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that improve training sta-bility while mitigating the attention sink phenomenon. We observe that these approaches address complementary weaknesses and propose Gated Sparse Attention (GSA), an architecture that realizes the benefits of both. GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores, an adaptive sparsity controller that modulates the number of attended tokens based on local uncertainty, and dual gating at the value and output stages. We establish theoretical foundations for the approach, including complexity analysis, expressiveness results, and convergence guarantees. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
