Rectified Sparse Attention

Yutao Sun; Tianzhu Ye; Li Dong; Yuqing Xia; Jian Chen; Yizhao Gao; Shijie Cao; Jianyong Wang; and Furu Wei

arXiv:2506.04108·cs.CL·June 6, 2025

Rectified Sparse Attention

Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, and Furu Wei

PDF

Open Access

TL;DR

Rectified Sparse Attention (ReSA) enhances long-sequence generation efficiency in large language models by combining sparse attention with periodic dense cache rectification, significantly improving speed while maintaining quality.

Contribution

ReSA introduces a novel method that bounds error accumulation in sparse attention by periodic dense cache refreshes, improving long-sequence generation quality and efficiency.

Findings

01

Achieves up to 2.42× speedup at 256K sequence length.

02

Maintains near-lossless generation quality across tasks.

03

Demonstrates effectiveness in math reasoning, language modeling, and retrieval.

Abstract

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42 $\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms

MethodsSoftmax · Attention Is All You Need