Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

arXiv:2603.11487·cs.LG·April 20, 2026

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

PDF

1 Repo

TL;DR

This paper proves that attention sinks are sometimes necessary in softmax Transformers due to normalization constraints, and demonstrates their presence in models solving trigger-conditional tasks, contrasting with ReLU attention.

Contribution

It formally shows that normalization in softmax attention can induce sinks, and provides experimental evidence contrasting softmax and ReLU attention behaviors.

Findings

01

Softmax attention models develop strong sinks in trigger-conditional tasks.

02

ReLU attention can solve the same tasks without inducing sinks.

03

Normalization constraints are the fundamental driver of sink behavior.

Abstract

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? We prove that, in some settings, it is the latter: computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuvmilo/sinks-are-provably-necessary
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.