FASA: Frequency-aware Sparse Attention
Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley

TL;DR
FASA introduces a dynamic, query-aware token pruning method based on frequency-chunk analysis, significantly reducing memory use in large language models while maintaining high accuracy.
Contribution
FASA leverages a novel frequency-chunk sparsity insight to identify salient tokens, enabling efficient and robust attention pruning in long-context language modeling.
Findings
Achieves near full-KV performance with only 256 tokens on LongBench-V1.
Provides 2.56× speedup with 18.9% cache usage on AIME24.
Outperforms all token-eviction baselines across various long-context tasks.
Abstract
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full…
Peer Reviews
Decision·ICLR 2026 Poster
- novel idea (and observation) to use functional sparsity of RoPE - speedup demonstrated at long context, and can work with other KV compression schemes as well - robust across datasets - because they do not re-index token positions, original absolute positions of tokens are preserved
- Not applicable to non-RoPE variants, further discussion on that would help. - The idea is impressive, but it took quite some effort to understand. I think a huge amount of math can be simplified in the paper, maybe shifted to appendix for ‘more details’.
1. The discovery of "functional sparsity at the frequency block level" in RoPE provides a new, theoretically grounded perspective for understanding attention mechanisms and designing sparse attention models, rather than relying solely on heuristics. 2. The core hypotheses were sufficiently verified; in addition to verifying the sparsity of the dominant FCs, the universality and task-invariance of the CA index were also verified. 3. Strong practicality: The paper considers different hardware cons
1. Strong dependence on RoPE: The entire methodology and core findings are built upon the analysis of RoPE. This severely limits the method's generalizability. It remains unclear whether or how FASA can be generalized to models using other positional encodings or even those without explicit positional encoding. 2. Table 6 shows the robustness to the data. Only the TREC and MATH datasets are provided here, which is relatively limited in variety. Furthermore, the data size is not analyzed.
1. Their insights about RoPE are interesting. 2. According to their experiments, FASA appears to be effective.
- The authors mention that their insights about frequency chunks relate to RoPE and cite previous works. However, since this aspect is critical to FASA, I believe they should include their theoretical analysis directly in the paper. It is difficult for readers to be convinced by the current content alone. - Regarding the experiments conducted to support their insights about sparsity in frequency chunks, I think more demonstration is needed. The figure in the main text simply compares two heatmap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
