Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Konstantin Berestizshevsky; Renzo Andri; Lukas Cavigelli

arXiv:2502.08363·cs.CL·August 25, 2025

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Top-Theta Attention introduces a training-free, threshold-based sparsification method for transformer attention, significantly reducing computational load during inference with minimal accuracy loss across NLP tasks.

Contribution

It proposes a novel, calibration-based thresholding technique for sparsifying attention without retraining, improving efficiency and robustness across data domains.

Findings

01

Achieves 3-10x reduction in V-cache usage.

02

Up to 10x fewer attention elements during inference.

03

Degrades accuracy by no more than 1%.

Abstract

We present Top-Theta (Top- $θ$ ) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top- $θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. **Training-free & model-centric.** Calibrate once per model, a few hundred samples, then reuse across domains (ARC-C → HumanEval/LongBench) — this is much cheaper than retrain/fine-tune-based sparsity. 2. **Tile- and kernel-friendly formulation.** Pure elementwise thresholding; no per-row top-k that breaks tiling. This is exactly where existing top-k attention hurts. 3. **Strong empirical coverage.** LLaMA-2, LLaMA-3, LLaMA-3.1; 7B→70B; prefill (QA) and decoding (HumanEval, LongBench); GQA ca

Weaknesses

1. **No wall-clock / kernel-level evaluation.** The main selling point is “better for tiled / distributed kernels than top-k,” but the paper does not show an actual implementation or runtime vs. FlashAttention+Top-k baselines on GPU. 2. **Calibration cost vs. model/library changes not fully discussed.** If KV layout or rotary settings change (common in serving stacks), do we need to recalibrate?

Reviewer 02Rating 2Confidence 5

Strengths

+ The proposed method does not require training. + The proposed method shows effectiveness on the state-of-the-art open-source LLM models.

Weaknesses

- Limited novelty. Sparse attention has been extensively studied, and many state-of-the-art methods already exist. The proposed Top-Theta or threshold-based sparsification appears to be a minor variation of the well-known top-k attention, which has been applied both in standard transformer architectures and in large language models. The contribution, therefore, seems incremental rather than fundamentally new. - Unclear motivation. The paper does not clearly explain why a content-based attention

Reviewer 03Rating 2Confidence 4

Strengths

* **High Efficiency**: Reduces **V-cache usage by 3–10×** and attention elements by **up to 10×** with <1% accuracy loss. * **Robust to Domain Shift**: Thresholds are **model-intrinsic** — calibrated once and work across tasks and datasets. * **Better Than Top-$k$**: Replaces expensive top-$k$ search with **constant-time thresholding**, removing row-wise dependencies.

Weaknesses

* This paper lacks a comprehensive review of the field of sparse attention, which has already been popular to study for a long time. * This paper lacks a comprehensive experimental comparison with current popular methods; it is not clear how this method outperforms other sparsity-based attention approximation algorithms. * The authors present a new method with promising performance in this work, but after checking the whole paper, it is still not clear what this paper wants to solve. If they jus

Code & Models

Repositories

kostyanoob/top-theta-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications

MethodsSoftmax · Attention Is All You Need · Pruning