Position Information Emerges in Causal Transformers Without Positional   Encodings via Similarity of Nearby Embeddings

Chunsheng Zuo; Pavel Guerzhoy; Michael Guerzhoy

arXiv:2501.00073·cs.CL·January 3, 2025

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy

PDF

Open Access

TL;DR

This paper demonstrates that causal transformers can inherently encode positional information through the similarity of nearby embeddings, even without explicit positional encodings, in both trained and untrained models.

Contribution

It introduces a new hypothesis that positional information emerges from embedding similarity patterns, challenging the need for explicit positional encodings in transformers.

Findings

01

Nearby embeddings are more similar than distant ones.

02

Positional information can be reconstructed from embedding similarities.

03

This pattern occurs in both trained and randomly initialized models.

Abstract

Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Neural Networks and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention