Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin; Jiaming Tang; Shuo Yang; Hanshuo Wang; Tian Tang; Boyu Tian; Ion Stoica; Song Han; Mingyu Gao

arXiv:2502.02770·cs.LG·November 5, 2025

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

PDF

Open Access 1 Video

TL;DR

Twilight introduces an adaptive attention sparsity framework that dynamically prunes tokens in large language models, significantly accelerating processing without losing accuracy, by leveraging top-$p$ sampling for flexible token budgeting.

Contribution

The paper presents Twilight, a novel framework that enables adaptive sparsity in attention mechanisms, improving efficiency while maintaining accuracy in long-context LLMs.

Findings

01

Prunes up to 98% of redundant tokens

02

Achieves 15.4x acceleration in self-attention

03

Attains 3.9x reduction in per token latency

Abstract

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top- $p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4 \times$ acceleration in self-attention operations and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Advanced Bandit Algorithms Research

MethodsSoftmax · Attention Is All You Need