TL;DR
This paper proves that attention sinks are sometimes necessary in softmax Transformers due to normalization constraints, and demonstrates their presence in models solving trigger-conditional tasks, contrasting with ReLU attention.
Contribution
It formally shows that normalization in softmax attention can induce sinks, and provides experimental evidence contrasting softmax and ReLU attention behaviors.
Findings
Softmax attention models develop strong sinks in trigger-conditional tasks.
ReLU attention can solve the same tasks without inducing sinks.
Normalization constraints are the fundamental driver of sink behavior.
Abstract
Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? We prove that, in some settings, it is the latter: computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
