RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng

TL;DR
This paper proves that Rotary Positional Embeddings (RoPE) have fundamental limitations in long-context Transformers, failing to reliably distinguish positions or tokens as context length grows, and suggests new mechanisms are needed.
Contribution
The paper provides a theoretical analysis demonstrating intrinsic limitations of RoPE in long contexts and shows that increasing the RoPE base cannot simultaneously distinguish positions and tokens.
Findings
RoPE loses locality bias at long contexts.
Attention scores can remain unchanged despite token position changes.
Multi-head, multi-layer architectures cannot overcome RoPE limitations.
Abstract
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
