Sparse Sinkhorn Attention

Yi Tay; Dara Bahri; Liu Yang; Donald Metzler; and Da-Cheng Juan

arXiv:2002.11296·cs.LG·February 27, 2020·77 cites

Sparse Sinkhorn Attention

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan

PDF

Open Access 1 Repo 1 Video

TL;DR

Sparse Sinkhorn Attention introduces a differentiable sorting-based sparse attention mechanism that enhances memory efficiency and performance across various sequence modeling tasks.

Contribution

It presents a novel, differentiable sorting approach with a meta sorting network and new algorithms like Causal Sinkhorn Balancing and SortCut for efficient attention.

Findings

01

Outperforms recent efficient Transformer models in multiple tasks

02

Achieves competitive results with vanilla attention while reducing memory usage

03

Demonstrates versatility across seq2seq, language modeling, and image generation

Abstract

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/sinkhorn-transformer
pytorch

Videos

Sparse Sinkhorn Attention· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Feedforward Network · SortCut Sinkhorn Attention · Sparse Sinkhorn Attention · Sinkhorn Transformer · Sigmoid Activation · Tanh Activation · Residual Connection