A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention
Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai

TL;DR
This paper investigates the effectiveness and theoretical basis of native Top-$k$ Sparse Attention in large language models, demonstrating its potential to reduce computation while maintaining or improving performance in long-context tasks.
Contribution
It provides empirical validation and theoretical insights into Top-$k$ Attention, including training strategies, approximation effects, and entropy-based interpretations.
Findings
Exact Top-$k$ Decoding achieves comparable or better performance than full attention.
Training with Top-$k$ attention enhances downstream task results.
Higher approximation fidelity correlates with better task performance.
Abstract
Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top- Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top- Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top- Attention training strategy. Experiments confirm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
