You Need Better Attention Priors

Elon Litman; Gabe Guo

arXiv:2601.15380·cs.LG·January 23, 2026

You Need Better Attention Priors

Elon Litman, Gabe Guo

PDF

Open Access

TL;DR

This paper introduces GOAT, a novel attention mechanism based on Entropic Optimal Transport, which learns trainable priors to improve flexibility, interpretability, and length generalization in attention models.

Contribution

It generalizes attention via Entropic Optimal Transport, replacing uniform priors with learnable priors, and integrates spatial info for better length extrapolation.

Findings

01

GOAT provides a learnable prior that improves attention flexibility.

02

It offers an EOT-based explanation for attention sinks.

03

GOAT achieves better length generalization with spatial information.

Abstract

We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis