Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Peter S\'uken\'ik; Cristina L\'opez Amado; Christoph H. Lampert; Marco Mondelli

arXiv:2605.08453·cs.LG·May 12, 2026

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Peter S\'uken\'ik, Cristina L\'opez Amado, Christoph H. Lampert, Marco Mondelli

PDF

TL;DR

This paper investigates sinks and diagonal patterns as mechanisms for attention switching and preventing oversmoothing, providing geometric analysis, empirical verification, and comparisons to explain their roles in transformers.

Contribution

It introduces a geometric analysis of sinks, clarifies their role in oversmoothing prevention, and compares their costs to diagonal patterns, explaining their prevalence in pretrained transformers.

Findings

01

Sinks can be represented under specific geometric conditions.

02

Dense attention smooths more than sparse attention when certain conditions are met.

03

Sinks are favored over diagonal patterns in pretrained transformers due to cost advantages.

Abstract

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.