Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Non-Dispersive Language Modeling

Xingyue Huang; Xueying Ding; Mingxuan Ju; Yozen Liu; Neil Shah; Tong Zhao

arXiv:2601.12145·cs.LG·April 17, 2026

Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Non-Dispersive Language Modeling

Xingyue Huang, Xueying Ding, Mingxuan Ju, Yozen Liu, Neil Shah, Tong Zhao

PDF

TL;DR

This paper introduces Threshold Differential Attention (TDA), a sink-free, ultra-sparse attention mechanism that improves long-context language modeling by controlling spurious attention and eliminating sinks without high computational costs.

Contribution

The paper proposes TDA, a novel attention method that achieves ultra-sparsity, sink elimination, and robustness in long-context language models, with theoretical guarantees and competitive empirical results.

Findings

01

TDA produces over 99% exact zeros in attention weights.

02

TDA eliminates attention sinks while maintaining performance.

03

Theoretically, TDA controls spurious survivors to O(1) per row.

Abstract

Softmax attention struggles with long contexts due to structural limitations: the strict sum-to-one constraint forces attention sinks on irrelevant tokens, and probability mass disperses as sequence lengths increase. We tackle these problems with Threshold Differential Attention (TDA), a sink-free attention mechanism that achieves ultra-sparsity and improved robustness at longer sequence lengths without the computational overhead of projection methods or the performance degradation caused by noise accumulation of standard rectified attention. TDA applies row-wise extreme-value thresholding with a length-dependent gate, retaining only exceedances. Inspired by the differential transformer, TDA also subtracts an inhibitory view to enhance expressivity. Theoretically, we prove that TDA controls the expected number of spurious survivors per row to $O (1)$ and that consensus spurious matches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.