Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

TL;DR
This paper introduces a pre-scoring framework for efficient attention in transformers that prioritizes informative keys, improving long-context modeling accuracy and efficiency across language and vision tasks.
Contribution
The paper presents a novel pre-scoring method that enhances approximate attention by identifying structurally important keys, with demonstrated improvements in language and vision transformer performance.
Findings
Perplexity decreases from 12.0 to 9.5 on ChatGLM with 131k tokens.
Clustering-based scoring outperforms leverage-based methods under the same key budget.
The approach generalizes effectively to Vision Transformers, maintaining accuracy.
Abstract
Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper provides some theoretical analysis for the proposed method, but it is hard for me to understand what kind of guarantee it actually provides (see next section).
Overall, I have many concerns with the paper. First, I found the paper very hard to read. One of the reason is that the authors assume that the reader are very familiar with previous works LevAttention and HyperAttention. For example, many concepts are not introduced in the paper ("heavy attention scores" line 59, "statistical leverage scores" line 99, "polynomial based attention" line 100, "positional locality" line 103, "planted model" line 135, etc...). Similarly, the different theorems or a
1. Targeting the recall gap of HyperAttention by ranking keys beforehand is a clean, practical idea that directly addresses missed heavy scores. The algorithms are presented with simple wrappers over HyperAttention. 2.The planted-subspace analysis and Theorems 1–2 formalize when clustering isolates heavy keys, matching the empirical intuition that important keys align with near-orthogonal directions. 3. Results span LongBench perplexity on GLM2 and GLM3, speed comparisons vs FlashAttention, an
1. The strongest PPL ≈ 8.3 appears tied to the min_seq_len ≥ n_query configuration and sometimes even top-k set to zero, which partially credits an optimization switch rather than the proposed pre-scoring itself. The paper should isolate gains from pre-scoring vs implementation flags and report both. 2.Speedups are reported per layer against FlashAttention and discussed asymptotically, but it is unclear how these translate to whole-model throughput and latency under realistic batch sizes and se
+ Clear and practical idea: The paper provides a straightforward approach to enhance HyperAttention by pre-scoring and then attending. This directly addresses a known issue: HyperAttention’s hashing is not aware of which keys matter, and LevAttention’s “universal set” can get large. The bridge between them is simple and useful in practice. + Mix of theory and experiments: The paper offers proofs under a standard planted-subspace setup (to argue why the pre-scoring should work) and shows results
- Reason for PPL improvement: The best perplexity (~8.31) happens when pre-scoring is off (top-k = 0, sample_size = 0) and min_seq_len ≥ n_query is set. The paper itself says this gain comes from that configuration (forcing the faster block/tiled path), not from pre-scoring. A clean ablation is needed to separate the effects. - Unclear speedup claims: > Compared to the original HyperAttention, these methods can generate a mild acceleration, with performance becoming more remarkable starting a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Big Data and Digital Economy
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
