Rectified Sparse Attention
Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, and Furu Wei

TL;DR
Rectified Sparse Attention (ReSA) enhances long-sequence generation efficiency in large language models by combining sparse attention with periodic dense cache rectification, significantly improving speed while maintaining quality.
Contribution
ReSA introduces a novel method that bounds error accumulation in sparse attention by periodic dense cache refreshes, improving long-sequence generation quality and efficiency.
Findings
Achieves up to 2.42× speedup at 256K sequence length.
Maintains near-lossless generation quality across tasks.
Demonstrates effectiveness in math reasoning, language modeling, and retrieval.
Abstract
Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42 end-to-end speedup under decoding at 256K sequence length, making it a practical solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
MethodsSoftmax · Attention Is All You Need
