Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

TL;DR
Top-Theta Attention introduces a training-free, threshold-based sparsification method for transformer attention, significantly reducing computational load during inference with minimal accuracy loss across NLP tasks.
Contribution
It proposes a novel, calibration-based thresholding technique for sparsifying attention without retraining, improving efficiency and robustness across data domains.
Findings
Achieves 3-10x reduction in V-cache usage.
Up to 10x fewer attention elements during inference.
Degrades accuracy by no more than 1%.
Abstract
We present Top-Theta (Top-) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top- achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Training-free & model-centric.** Calibrate once per model, a few hundred samples, then reuse across domains (ARC-C → HumanEval/LongBench) — this is much cheaper than retrain/fine-tune-based sparsity. 2. **Tile- and kernel-friendly formulation.** Pure elementwise thresholding; no per-row top-k that breaks tiling. This is exactly where existing top-k attention hurts. 3. **Strong empirical coverage.** LLaMA-2, LLaMA-3, LLaMA-3.1; 7B→70B; prefill (QA) and decoding (HumanEval, LongBench); GQA ca
1. **No wall-clock / kernel-level evaluation.** The main selling point is “better for tiled / distributed kernels than top-k,” but the paper does not show an actual implementation or runtime vs. FlashAttention+Top-k baselines on GPU. 2. **Calibration cost vs. model/library changes not fully discussed.** If KV layout or rotary settings change (common in serving stacks), do we need to recalibrate?
+ The proposed method does not require training. + The proposed method shows effectiveness on the state-of-the-art open-source LLM models.
- Limited novelty. Sparse attention has been extensively studied, and many state-of-the-art methods already exist. The proposed Top-Theta or threshold-based sparsification appears to be a minor variation of the well-known top-k attention, which has been applied both in standard transformer architectures and in large language models. The contribution, therefore, seems incremental rather than fundamentally new. - Unclear motivation. The paper does not clearly explain why a content-based attention
* **High Efficiency**: Reduces **V-cache usage by 3–10×** and attention elements by **up to 10×** with <1% accuracy loss. * **Robust to Domain Shift**: Thresholds are **model-intrinsic** — calibrated once and work across tasks and datasets. * **Better Than Top-\(k\)**: Replaces expensive top-\(k\) search with **constant-time thresholding**, removing row-wise dependencies.
* This paper lacks a comprehensive review of the field of sparse attention, which has already been popular to study for a long time. * This paper lacks a comprehensive experimental comparison with current popular methods; it is not clear how this method outperforms other sparsity-based attention approximation algorithms. * The authors present a new method with promising performance in this work, but after checking the whole paper, it is still not clear what this paper wants to solve. If they jus
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
MethodsSoftmax · Attention Is All You Need · Pruning
