SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Zhenyi Shen; Junru Lu; Lin Gui; Jiazheng Li; Yulan He; Di Yin; Xing Sun

arXiv:2511.20102·cs.CL·February 2, 2026

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun

PDF

Open Access 4 Models

TL;DR

SSA introduces a training framework that aligns sparse and full attention outputs, enabling models to maintain high performance and long-context capabilities while reducing computational complexity.

Contribution

It proposes a novel bidirectional alignment method for sparse and full attention, improving performance and adaptability across different sparsity levels.

Findings

01

Achieves state-of-the-art results in sparse and full attention modes.

02

Reduces attention gap and capability gap effectively.

03

Demonstrates superior long-context processing capabilities.

Abstract

Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA (Sparse Sparse Attention), a training framework that integrates both sparse and full attention with bidirectional attention-output alignment. We prove that the approximation error scales linearly with the attention mass dropped under sparse attention, and show that SSA's alignment objective substantially reduces this quantity compared to baselines. Experiments demonstrate that SSA achieves state-of-the-art performance under both inference modes, adapts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications