MrRoPE: Mixed-radix Rotary Position Embedding

Qingyuan Tian; Wenhong Zhu; Xiaoran Liu; Xiaofeng Wang; Rui Wang

arXiv:2601.22181·cs.CL·February 2, 2026

MrRoPE: Mixed-radix Rotary Position Embedding

Qingyuan Tian, Wenhong Zhu, Xiaoran Liu, Xiaofeng Wang, Rui Wang

PDF

Open Access 3 Reviews

TL;DR

MrRoPE introduces a unified, theory-based approach to extend Rotary Position Embedding for longer sequences, enabling high-performance, training-free generalization in large-context tasks.

Contribution

It proposes a generalized radix conversion framework for RoPE extensions and introduces two training-free methods, MrRoPE-Uni and MrRoPE-Pro, for effective long-sequence encoding.

Findings

01

MrRoPE-Pro achieves over 85% recall on 128K-context tasks.

02

It more than doubles YaRN's accuracy on Infinite-Bench retrieval.

03

Theoretical analysis confirms increased upper bounds for encoding length.

Abstract

Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose MrRoPE (Mixed-radix RoPE), a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies, respectively, to achieve 'train short, test long' generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN's accuracy on…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

The work addresses the critical problem of training-free context extension for LLMs, and its unifying framework could have a profound impact on future research. The connection between RoPE extension and radix conversion theory is a highly novel and insightful perspective, elevating heuristic designs to a theoretical level.

Weaknesses

1. The characterization of YaRN as "regressive" is based on observation rather than rigorous derivation. The paper fails to mathematically prove this from YaRN's original equations. 2. The method heavily relies on hyperparameters (`α`, `β`) inherited from YaRN, yet lacks a sensitivity analysis or ablation study to verify their optimality for the new method. 3. The comparison is limited to training-free RoPE methods, failing to adequately position the work relative to fine-tuning-based approaches

Reviewer 02Rating 6Confidence 2

Strengths

- MrRoPE constructs a universal framework from the radix conversion perspective, and unifies mainstream RoPE extension methods such as NTK and YaRN into different base conversion strategies, which solves the problem of the lack of a unified theoretical basis for the existing extension schemes, and facilitates system analysis and optimization. - The proposed MrRoPE-Pro requires no additional fine-tuning and performs outstandingly in long context tasks: over 85% recall in Needle-in-a-Haystack test

Weaknesses

- The RoPE extension method (MrRoPE-Uni/Pro) proposed in this paper only focuses on the “no-training” context-window extension mode, and no fine-tuning experiments have been conducted. This makes it impossible to make a direct comparison with extension methods such as xPOS and LongRoPE, which need to be evaluated in fine-tuning scenarios, and it is difficult to comprehensively prove its superiority in different application scenarios and its integration potential with other methods. - It is well

Reviewer 03Rating 6Confidence 4

Strengths

The unifying theory is good and the results are compelling across different tasks. Good to see a training free method has consistent gains.

Weaknesses

It is not clear if this radix idea can work for other positional encoding method, which limits the impact to RoPE. Also not clear if the progressive scaling in the proposed method introduces any latency compared to YaRN? Would also be helpful to have sensitive analysis, about the parameters inherited from YaRN in appendix B.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques