Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Non-Dispersive Language Modeling
Xingyue Huang, Xueying Ding, Mingxuan Ju, Yozen Liu, Neil Shah, Tong Zhao

TL;DR
This paper introduces Threshold Differential Attention (TDA), a sink-free, ultra-sparse attention mechanism that improves long-context language modeling by controlling spurious attention and eliminating sinks without high computational costs.
Contribution
The paper proposes TDA, a novel attention method that achieves ultra-sparsity, sink elimination, and robustness in long-context language models, with theoretical guarantees and competitive empirical results.
Findings
TDA produces over 99% exact zeros in attention weights.
TDA eliminates attention sinks while maintaining performance.
Theoretically, TDA controls spurious survivors to O(1) per row.
Abstract
Softmax attention struggles with long contexts due to structural limitations: the strict sum-to-one constraint forces attention sinks on irrelevant tokens, and probability mass disperses as sequence lengths increase. We tackle these problems with Threshold Differential Attention (TDA), a sink-free attention mechanism that achieves ultra-sparsity and improved robustness at longer sequence lengths without the computational overhead of projection methods or the performance degradation caused by noise accumulation of standard rectified attention. TDA applies row-wise extreme-value thresholding with a length-dependent gate, retaining only exceedances. Inspired by the differential transformer, TDA also subtracts an inhibitory view to enhance expressivity. Theoretically, we prove that TDA controls the expected number of spurious survivors per row to and that consensus spurious matches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
