Sinkhorn doubly stochastic attention rank decay analysis

Michela Lapenna; Rita Fioresi; Bahman Gharesifard

arXiv:2604.07925·cs.LG·April 10, 2026

Sinkhorn doubly stochastic attention rank decay analysis

Michela Lapenna, Rita Fioresi, Bahman Gharesifard

PDF

TL;DR

This paper analyzes how Sinkhorn-normalized doubly stochastic attention preserves rank better across layers in Transformers, reducing signal degradation and improving performance in NLP and vision tasks.

Contribution

It demonstrates that Sinkhorn normalization maintains higher rank in attention matrices than Softmax, with theoretical and empirical validation across tasks.

Findings

01

Sinkhorn attention preserves rank more effectively than Softmax.

02

Rank decay to one is doubly exponential with depth.

03

Skip connections are essential to mitigate rank collapse.

Abstract

The self-attention mechanism is central to the success of Transformer architectures. However, standard row-stochastic attention has been shown to suffer from significant signal degradation across layers. In particular, it can induce rank collapse, resulting in increasingly uniform token representations, as well as entropy collapse, characterized by highly concentrated attention distributions. Recent work has highlighted the benefits of doubly stochastic attention as a form of entropy regularization, promoting a more balanced attention distribution and leading to improved empirical performance. In this paper, we study rank collapse across network depth and show that doubly stochastic attention matrices normalized with Sinkhorn algorithm preserve rank more effectively than standard Softmax row-stochastic ones. As previously shown for Softmax, skip connections are crucial to mitigate rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.