ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Ashkan Shahbazi; Elaheh Akbari; Darian Salehi; Xinran Liu; Navid Naderializadeh; Soheil Kolouri

arXiv:2502.07962·cs.LG·July 15, 2025

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid Naderializadeh, Soheil Kolouri

PDF

Open Access 1 Video

TL;DR

ESPFormer introduces a fully parallelizable doubly-stochastic attention mechanism using sliced optimal transport, improving efficiency and performance in various deep learning tasks without iterative normalization.

Contribution

The paper proposes a novel attention method based on sliced optimal transport that enforces doubly-stochasticity efficiently without iterative Sinkhorn normalization.

Findings

01

Improved attention regularization enhances model performance.

02

Consistent gains across image, text, and point cloud tasks.

03

Efficient, differentiable implementation suitable for deep learning.

Abstract

While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management