A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance
Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen

TL;DR
This paper introduces a mathematical framework for certified Top-$k$ sparse attention, providing bounds on approximation error and practical guidelines for reducing key scores while maintaining accuracy.
Contribution
It develops a unified theory linking total variation distance to attention truncation error, with explicit bounds and an asymptotic rule for minimal $k$ under Gaussian models.
Findings
Certified Top-$k$ attention reduces key scores by 2-4 times.
Theoretical bounds accurately predict the scaling of $k_ ext{epsilon}/n$.
Experiments validate the framework on BERT and synthetic data.
Abstract
We develop a unified mathematical framework for certified Top- attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution and its Top- truncation , we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies , yielding sharp Top--specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as with ,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Adversarial Robustness in Machine Learning
