A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

Georgios Tzachristas; Lei Deng; Ioannis Tzachristas; Gong Zhang; Renhai Chen

arXiv:2512.07647·cs.LG·December 9, 2025

A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance

Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen

PDF

Open Access

TL;DR

This paper introduces a mathematical framework for certified Top-$k$ sparse attention, providing bounds on approximation error and practical guidelines for reducing key scores while maintaining accuracy.

Contribution

It develops a unified theory linking total variation distance to attention truncation error, with explicit bounds and an asymptotic rule for minimal $k$ under Gaussian models.

Findings

01

Certified Top-$k$ attention reduces key scores by 2-4 times.

02

Theoretical bounds accurately predict the scaling of $k_ ext{epsilon}/n$.

03

Experiments validate the framework on BERT and synthetic data.

Abstract

We develop a unified mathematical framework for certified Top- $k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top- $k$ truncation $\hat{P}$ , we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $TV (P, \hat{P}) = 1 - e^{- KL (\hat{P} ∥ P)}$ , yielding sharp Top- $k$ -specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $TV (P, \hat{P})$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $∥ Attn (q, K, V) - Attn_{k} (q, K, V) ∥_{2} = τ ∥ μ_{tail} - μ_{head} ∥_{2}$ with $τ = TV (P, \hat{P})$ ,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Adversarial Robustness in Machine Learning