Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

Jeffrey Willette; Heejun Lee; Sung Ju Hwang

arXiv:2505.11254·cs.LG·November 25, 2025

Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

Jeffrey Willette, Heejun Lee, Sung Ju Hwang

PDF

Open Access

TL;DR

This paper introduces Delta Attention, a correction method that significantly improves the accuracy of sparse attention inference in transformers, reducing performance degradation and increasing speed for long sequence processing.

Contribution

It proposes a novel distributional shift correction technique that enhances sparse attention accuracy without sacrificing efficiency, applicable on any sparse attention method.

Findings

01

Achieves 36% performance increase over baseline sparse attention.

02

Recovers 88% of quadratic attention accuracy on the RULER benchmark.

03

Maintains 98.5% sparsity, enabling 32x faster inference than Flash Attention 2.

Abstract

The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Data Storage Technologies · Big Data and Digital Economy

MethodsSoftmax · Attention Is All You Need · ALIGN