Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Zhexiang Li; Haoyu Wang; Yutong Bao; David Woodruff

arXiv:2505.11040·cs.LG·February 10, 2026

Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a pre-scoring framework for efficient attention in transformers that prioritizes informative keys, improving long-context modeling accuracy and efficiency across language and vision tasks.

Contribution

The paper presents a novel pre-scoring method that enhances approximate attention by identifying structurally important keys, with demonstrated improvements in language and vision transformer performance.

Findings

01

Perplexity decreases from 12.0 to 9.5 on ChatGLM with 131k tokens.

02

Clustering-based scoring outperforms leverage-based methods under the same key budget.

03

The approach generalizes effectively to Vision Transformers, maintaining accuracy.

Abstract

Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The paper provides some theoretical analysis for the proposed method, but it is hard for me to understand what kind of guarantee it actually provides (see next section).

Weaknesses

Overall, I have many concerns with the paper. First, I found the paper very hard to read. One of the reason is that the authors assume that the reader are very familiar with previous works LevAttention and HyperAttention. For example, many concepts are not introduced in the paper ("heavy attention scores" line 59, "statistical leverage scores" line 99, "polynomial based attention" line 100, "positional locality" line 103, "planted model" line 135, etc...). Similarly, the different theorems or a

Reviewer 02Rating 6Confidence 5

Strengths

1. Targeting the recall gap of HyperAttention by ranking keys beforehand is a clean, practical idea that directly addresses missed heavy scores. The algorithms are presented with simple wrappers over HyperAttention. 2.The planted-subspace analysis and Theorems 1–2 formalize when clustering isolates heavy keys, matching the empirical intuition that important keys align with near-orthogonal directions. 3. Results span LongBench perplexity on GLM2 and GLM3, speed comparisons vs FlashAttention, an

Weaknesses

1. The strongest PPL ≈ 8.3 appears tied to the min_seq_len ≥ n_query configuration and sometimes even top-k set to zero, which partially credits an optimization switch rather than the proposed pre-scoring itself. The paper should isolate gains from pre-scoring vs implementation flags and report both. 2.Speedups are reported per layer against FlashAttention and discussed asymptotically, but it is unclear how these translate to whole-model throughput and latency under realistic batch sizes and se

Reviewer 03Rating 2Confidence 3

Strengths

+ Clear and practical idea: The paper provides a straightforward approach to enhance HyperAttention by pre-scoring and then attending. This directly addresses a known issue: HyperAttention’s hashing is not aware of which keys matter, and LevAttention’s “universal set” can get large. The bridge between them is simple and useful in practice. + Mix of theory and experiments: The paper offers proofs under a standard planted-subspace setup (to argue why the pre-scoring should work) and shows results

Weaknesses

- Reason for PPL improvement: The best perplexity (~8.31) happens when pre-scoring is off (top-k = 0, sample_size = 0) and min_seq_len ≥ n_query is set. The paper itself says this gain comes from that configuration (forcing the faster block/tiled path), not from pre-scoring. A clean ablation is needed to separate the effects. - Unclear speedup claims: > Compared to the original HyperAttention, these methods can generate a mild acceleration, with performance becoming more remarkable starting a

Code & Models

Repositories

bruce-jiyuefeng/prescored-transformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Big Data and Digital Economy

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings