Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph

Kazuki Irie

arXiv:2501.00659·cs.LG·June 3, 2025

Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph

Kazuki Irie

PDF

Open Access

TL;DR

Deep autoregressive Transformers with multiple layers do not require explicit positional encodings to distinguish sequence order, as their layered structure inherently captures positional information, a fact that has been historically known but underappreciated.

Contribution

This paper revisits and clarifies the longstanding but underrecognized fact that multi-layer autoregressive Transformers do not need explicit positional encodings, providing historical context and explanation.

Findings

01

Multi-layer autoregressive Transformers can distinguish sequence order without explicit PEs.

02

One-layer models require positional encodings to identify token order.

03

The property has been known historically but is not widely disseminated today.

Abstract

Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is 'no' provided they have more than one layer -- they can distinguish sequences with permuted tokens without the need for explicit PEs. This follows from the fact that a cascade of (permutation invariant) set processors can collectively exhibit sequence-sensitive behavior in the autoregressive setting. This property has been known since early efforts (contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated, leading to recent rediscoveries. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2/3, but perhaps also due to the lack of a clear explanation in prior work, despite being commonly understood by practitioners in the past. Here we review the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · 3D Surveying and Cultural Heritage · Archaeological and Geological Studies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Cosine Annealing · Adam · Residual Connection