Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu

TL;DR
This paper introduces an extension to Rotary Position Embeddings that re-incorporates the imaginary component of the complex-valued attention scores, significantly improving long-context modeling in Large Language Models.
Contribution
It proposes a novel method that leverages the full complex-valued representation in RoPE, enhancing long-context dependency modeling in LLMs.
Findings
Improved performance on long-context benchmarks
Enhanced modeling of positional information
Benefits increase with longer contexts
Abstract
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations…
Peer Reviews
Decision·ICLR 2026 Poster
1. Novel perspective: Identifying and addressing the discarded imaginary component in RoPE is creative and theoretically motivated. The observation that imaginary attention captures longer-range dependencies is interesting. 2. Theoretical justification: The paper provides mathematical grounding (Equations 2-5) showing that imaginary attention follows a sine integral characteristic curve, complementing the cosine integral of real attention. 3. Generalization: Method generalizes to diffusion/bidir
1. Limited scale: Experiments only go to 700m parameters, which is quite small by modern LLM standards. It's unclear if benefits hold at 7b+ scale where most practical long-context work happens. 2. Modest improvements: Performance gains are often marginal. In Table 2, RoPE++_EC only outperforms RoPE by ~1-2 points on average. 3. No plug-and-play extrapolation: The authors acknowledge (Section 5.3, Limitation) that RoPE++ doesn't provide direct length extrapolation like other methods, limiting pr
1. The idea is novel. The authors had the good observation that RoPE only captures the real part when interpreted as complex multiplications. 2. Good theoretic motivations and explanations. 3. The evaluation uses a good set of benchmarks and looks convincing given their training horizon (50B tokens) and their scale, though I'm skeptical how good it will continue to be if we train for longer for reasonable amount of tokens.
1. Experiment setup doesn't seem to have enough scale to convincingly show the gain in model pre-training. 50B token is unfortunately sometimes too small to have confidence about certain pre-training signals, though I understand that there is typically a budget issue in academia. 2. I'm skeptical about whether the setup is bug-free and uses reasonable hyperparameters, as the training for ALiBi and NoPE seem to have to take compromises.
- proposes new method for positional encoding which shows promise, especially for long context - performs extensive experimentations on different datasets - includes several positional encoding methods as baseline
- would be great to have baseline for RoPE method, which target long context - theoretical justification of method could be improved
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
