AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

Di Liu; Ruitian Wang; Chen Chen; Mingliang Gong; Yongjie Yuan; Han Zhao; Yu Feng; Quan Chen; Minyi Guo

arXiv:2605.12110·cs.DC·May 13, 2026

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

Di Liu, Ruitian Wang, Chen Chen, Mingliang Gong, Yongjie Yuan, Han Zhao, Yu Feng, Quan Chen, Minyi Guo

PDF

TL;DR

AB-Sparse is a novel framework that adaptively allocates block sizes in sparse attention to improve accuracy in long-context language models without sacrificing efficiency.

Contribution

It introduces adaptive block size allocation and lossless quantization, addressing the limitations of uniform block sizes in previous sparse attention methods.

Findings

01

Achieves up to 5.43% accuracy improvement over baselines.

02

Maintains throughput while enhancing accuracy.

03

Supports efficient execution with custom GPU kernels.

Abstract

As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.