RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning

Junhao Hu; Wenrui Huang; Weidong Wang; Zhenwen Li; Tiancheng Hu; Zhixia Liu; Xusheng Chen; Tao Xie; Yizhou Shan

arXiv:2502.11147·cs.LG·June 2, 2025

RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning

Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan

PDF

Open Access

TL;DR

This paper introduces RaaS, a novel attention sparsity algorithm for large language models that efficiently identifies and retains key milestone tokens during reasoning tasks, balancing accuracy, time, and memory.

Contribution

RaaS leverages a new attention pattern to identify milestone tokens, enabling efficient reasoning with reduced complexity while maintaining high accuracy.

Findings

01

Achieves high accuracy with O(L) time and memory complexities.

02

Effectively identifies and retains milestone tokens during reasoning.

03

Balances the accuracy-time-memory trade-off in LLM reasoning tasks.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires an LLM to generate long sequences, incurring $O (N)$ time and memory complexities per token, where $N$ is the current sequence length. To reduce complexities, existing sparsity-based algorithms propose to retain Key-Value (KV) vectors, the intermediate representations of only the most critical tokens. However, these algorithms struggle with the "impossible trinity" of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with $O (L)$ time but $O (N)$ memory ( $L$ is the cache budget, $L ≪ N$ ). To address the "impossible trinity", in this paper, we identify a new attention pattern during the decode stage of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Image Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need