Sinkhorn doubly stochastic attention rank decay analysis
Michela Lapenna, Rita Fioresi, Bahman Gharesifard

TL;DR
This paper analyzes how Sinkhorn-normalized doubly stochastic attention preserves rank better across layers in Transformers, reducing signal degradation and improving performance in NLP and vision tasks.
Contribution
It demonstrates that Sinkhorn normalization maintains higher rank in attention matrices than Softmax, with theoretical and empirical validation across tasks.
Findings
Sinkhorn attention preserves rank more effectively than Softmax.
Rank decay to one is doubly exponential with depth.
Skip connections are essential to mitigate rank collapse.
Abstract
The self-attention mechanism is central to the success of Transformer architectures. However, standard row-stochastic attention has been shown to suffer from significant signal degradation across layers. In particular, it can induce rank collapse, resulting in increasingly uniform token representations, as well as entropy collapse, characterized by highly concentrated attention distributions. Recent work has highlighted the benefits of doubly stochastic attention as a form of entropy regularization, promoting a more balanced attention distribution and leading to improved empirical performance. In this paper, we study rank collapse across network depth and show that doubly stochastic attention matrices normalized with Sinkhorn algorithm preserve rank more effectively than standard Softmax row-stochastic ones. As previously shown for Softmax, skip connections are crucial to mitigate rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
