RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning
Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan

TL;DR
This paper introduces RaaS, a novel attention sparsity algorithm for large language models that efficiently identifies and retains key milestone tokens during reasoning tasks, balancing accuracy, time, and memory.
Contribution
RaaS leverages a new attention pattern to identify milestone tokens, enabling efficient reasoning with reduced complexity while maintaining high accuracy.
Findings
Achieves high accuracy with O(L) time and memory complexities.
Effectively identifies and retains milestone tokens during reasoning.
Balances the accuracy-time-memory trade-off in LLM reasoning tasks.
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires an LLM to generate long sequences, incurring time and memory complexities per token, where is the current sequence length. To reduce complexities, existing sparsity-based algorithms propose to retain Key-Value (KV) vectors, the intermediate representations of only the most critical tokens. However, these algorithms struggle with the "impossible trinity" of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with time but memory ( is the cache budget, ). To address the "impossible trinity", in this paper, we identify a new attention pattern during the decode stage of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Image Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need
