On the Expressive Power of Contextual Relations in Transformers
Demi\'an Fraiman

TL;DR
This paper introduces a measure-theoretic framework for understanding the expressive power of Transformers in modeling contextual relations, linking attention mechanisms to optimal transport and proving their universal approximation capabilities.
Contribution
It establishes a unified probabilistic view of attention, connects softmax attention to entropy-regularized optimal transport, and proves Transformers can universally approximate contextual relations.
Findings
Transformers can approximate arbitrary contextual relations.
Softmax attention is connected to entropy-regularized optimal transport.
Normalization choice influences how relations are represented.
Abstract
Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
