On the Geometry of Positional Encodings in Transformers
Giansalvo Cirrincione

TL;DR
This paper develops a mathematical theory of positional encodings in Transformers, establishing their necessity, properties, optimal construction, and parameter efficiency, supported by experiments on language tasks.
Contribution
It provides the first rigorous theoretical framework for understanding and designing positional encodings in Transformers, including optimal encoding construction and minimal parametrization.
Findings
Transformers without positional signals cannot solve order-sensitive tasks.
Distinct positional vectors are assigned at all global minima under mild conditions.
Attention with Linear Biases (ALiBi) achieves lower stress than sinusoidal encodings.
Abstract
Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
