Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy

TL;DR
This paper demonstrates that causal transformers can inherently encode positional information through the similarity of nearby embeddings, even without explicit positional encodings, in both trained and untrained models.
Contribution
It introduces a new hypothesis that positional information emerges from embedding similarity patterns, challenging the need for explicit positional encodings in transformers.
Findings
Nearby embeddings are more similar than distant ones.
Positional information can be reconstructed from embedding similarities.
This pattern occurs in both trained and randomly initialized models.
Abstract
Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Neural Networks and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Adam · Residual Connection · Multi-Head Attention
