Sparse Sinkhorn Attention
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan

TL;DR
Sparse Sinkhorn Attention introduces a differentiable sorting-based sparse attention mechanism that enhances memory efficiency and performance across various sequence modeling tasks.
Contribution
It presents a novel, differentiable sorting approach with a meta sorting network and new algorithms like Causal Sinkhorn Balancing and SortCut for efficient attention.
Findings
Outperforms recent efficient Transformer models in multiple tasks
Achieves competitive results with vanilla attention while reducing memory usage
Demonstrates versatility across seq2seq, language modeling, and image generation
Abstract
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Feedforward Network · SortCut Sinkhorn Attention · Sparse Sinkhorn Attention · Sinkhorn Transformer · Sigmoid Activation · Tanh Activation · Residual Connection
