SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu; Jiangfei Duan; Chang Chen; Siran Liu; Guanyu Feng; Xin Lv; Xiao Chuanfu; Dahua Lin; Chao Yang

arXiv:2406.15486·cs.CL·September 4, 2025

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang

PDF

Open Access

TL;DR

SampleAttention introduces an adaptive sparse attention mechanism that significantly accelerates long-context LLM inference with minimal accuracy loss by capturing local and column stripe patterns efficiently.

Contribution

It proposes a novel near-lossless sparse attention method that dynamically captures important sparse patterns at runtime without additional training.

Findings

01

Reduces Time-to-First-Token by up to 2.42x compared to FlashAttention.

02

Achieves near-zero accuracy loss when replacing vanilla attention.

03

Effectively captures local and column stripe patterns with low overhead.

Abstract

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training