Tactic: Adaptive Sparse Attention with Clustering and Distribution   Fitting for Long-Context LLMs

Kan Zhu; Tian Tang; Qinyu Xu; Yile Gu; Zhichen Zeng; Rohan Kadekodi,; Liangyu Zhao; Ang Li; Arvind Krishnamurthy; Baris Kasikci

arXiv:2502.12216·cs.LG·February 19, 2025

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi,, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci

PDF

Open Access

TL;DR

Tactic introduces an adaptive sparse attention mechanism for long-context LLMs that dynamically selects tokens based on attention importance, improving efficiency and accuracy over fixed-budget methods.

Contribution

It proposes a novel, calibration-free sparse attention method using clustering and distribution fitting to adaptively select tokens based on attention scores.

Findings

01

Achieves up to 7.29x speedup in decode attention

02

Outperforms existing sparse attention algorithms in accuracy

03

Provides a 1.58x overall inference speedup

Abstract

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Data Mining Algorithms and Applications

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training