SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

Xiaodong Ji; Hailin Zhang; Fangcheng Fu; Bin Cui

arXiv:2505.24179·cs.LG·June 2, 2025

SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui

PDF

Open Access 1 Repo 3 Reviews

TL;DR

SALE introduces a fine-grained sparse attention method using 4-bit quantization and block-sparse attention to significantly accelerate long-context LLM inference with minimal accuracy loss, achieving over 3x speedups.

Contribution

The paper presents SALE, a novel sparse attention technique that combines 4-bit quantized query-key products with block-sparse attention, enabling efficient long-context processing without retraining.

Findings

01

Achieves 3.36x speedup on Llama-3.1-8B for sequences over 64K.

02

Maintains comparable model accuracy with existing methods.

03

Requires no parameter training or extensive modifications.

Abstract

Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper is well-written, and the proposed method is explained clearly. The motivation for the Relative Attention Score is logical, and the kernel design is well-described. 2. The accuracy-efficiency trade-off plots (Figure 4) are the most important result and clearly demonstrate that SALE is Pareto-optimal, achieving higher accuracy at the same latency or lower latency at the same accuracy than all four baselines. 3. The idea of using a very fast, low-bit (4-bit) estimation pass to genera

Weaknesses

See Question below.

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper achieves more accurate importance estimation through per-head threshold calibration. 2. A lot of work has been done on kernel optimization, and the paper also introduces the optimization ideas and details in detail.

Weaknesses

1. For the stability of softmax, using the relative method is a common practice. It is not particularly innovative and can be explained more clearly in the background knowledge or other sections. 2. This does not prove the necessity of the relative importance approximation, and the same effect may be achieved through its simpler selection method.

Reviewer 03Rating 4Confidence 5

Strengths

1. This paper uses fine-grained attention weight approximation via low-bit quantized $Q$, $K$, instead of using the pooling based method as adopted by many previous works. This allows more precise identification of which attention blocks are truly important. 2. The selection pass is efficiently organized. The method avoids forming the full materialization of the $N \times N$ attention matrices, whicih has the potention of achieving low prediction overhead. 3. The authors implement custom CUDA ke

Weaknesses

1. **Limited accuracy improvement over baselines.** Although the proposed method achieves higher efficiency, its accuracy improvements are only marginal. On many benchmarks, the method performs comparably to or only slightly better than existing sparse attention baselines, and in several cases it does not achieve the best accuracy. 2. **Incomplete efficiency evaluation.** The efficiency analysis focuses primarily on the attention operation itself, rather than on end-to-end inference latency. Sin

Code & Models

Repositories

birdchristopher/sale
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Advanced Image Processing Techniques · Advanced Neural Network Applications

MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate