AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
Di Liu, Ruitian Wang, Chen Chen, Mingliang Gong, Yongjie Yuan, Han Zhao, Yu Feng, Quan Chen, Minyi Guo

TL;DR
AB-Sparse is a novel framework that adaptively allocates block sizes in sparse attention to improve accuracy in long-context language models without sacrificing efficiency.
Contribution
It introduces adaptive block size allocation and lossless quantization, addressing the limitations of uniform block sizes in previous sparse attention methods.
Findings
Achieves up to 5.43% accuracy improvement over baselines.
Maintains throughput while enhancing accuracy.
Supports efficient execution with custom GPU kernels.
Abstract
As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
