You Need Better Attention Priors
Elon Litman, Gabe Guo

TL;DR
This paper introduces GOAT, a novel attention mechanism based on Entropic Optimal Transport, which learns trainable priors to improve flexibility, interpretability, and length generalization in attention models.
Contribution
It generalizes attention via Entropic Optimal Transport, replacing uniform priors with learnable priors, and integrates spatial info for better length extrapolation.
Findings
GOAT provides a learnable prior that improves attention flexibility.
It offers an EOT-based explanation for attention sinks.
GOAT achieves better length generalization with spatial information.
Abstract
We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
