Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero; Alex Vitvitskyi; Christos Perivolaropoulos; Razvan Pascanu; Petar Veli\v{c}kovi\'c

arXiv:2410.06205·cs.CL·May 14, 2025

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veli\v{c}kovi\'c

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the internal mechanics of Rotary Positional Encodings in LLMs, revealing their use in constructing attention patterns and carrying semantic info, and proposes modifications to improve their effectiveness.

Contribution

It provides a detailed analysis of RoPE's role in LLMs, challenges common assumptions, and introduces a modified RoPE to enhance performance.

Findings

01

Gemma 7B learns to use RoPE for robust attention patterns.

02

Gemma prefers low frequencies of RoPE for semantic information.

03

Proposed RoPE modification improves model performance.

Abstract

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper successfully challenges, both theoretically and empirically, the traditional view that RoPE attenuates attention weights as relative distance between tokens increases. 2. The authors' hypothesis about the roles of the high-frequency and low-frequency components of RoPE is novel and insightful.

Weaknesses

1. The experimental validation lacks diversity in both foundation models and datasets. 2. Some perspectives and proofs regarding NoPE contain errors, while they don't affect the main conclusion, they may mislead readers. 3. The discussion of related work is insufficiently thorough, and few papers are cited (only about one page). 4. Throughout Section 4, the authors conceal a core assumption: that attention scores are interpretable or meaningful. Higher attention scores for certain tokens imply a

Reviewer 02Rating 5Confidence 4

Strengths

1. The paper provides a fresh perspective on RoPE, questioning existing assumptions and offering new explanations for its effectiveness. 2. The authors present mathematical proofs to support their claims, enhancing the credibility of their findings. 3. The use of the Gemma 7B model for empirical analysis adds practical relevance to the theoretical insights.

Weaknesses

1. Although the observed phenomena and mathematical proofs can support the paper's point of view, the experimental performance does not seem good enough. The paper hopes to adapt to any context length, but the actual experimental results only have one result on 8K. And the evaluation of PPL is not comprehensive enough. 2. At the semantic level, the results of the models in Table 2 should be compared on the general benchmark or other tasks that are more representative of semantics, which will be

Reviewer 03Rating 8Confidence 4

Strengths

- The authors conduct a novel theoretical and empirical study of RoPE encodings in transformer models. - They provide detailed proofs of their main claims. - The paper is clear and well-written. - This study can help researchers better understand the underlying mechanisms of popular transformer architectures and encourage research into alternative improved solutions.

Weaknesses

- The empirical study is limited to a single Gemma architecture. While this is unlikely, some results may be artifacts of the specific model selected. - In Section 3 authors train 2B model and show improvements on validation perplexity. While these results are positive, perplexity improvements do not always results in overall improvements in model's abilities. Authors could provide evaluation results on popular benchmarks* to build more convincing picture. * see, for ex, Section 2.3 in https:/

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Constraint Satisfaction and Optimization · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need