LongRoPE2: Near-Lossless LLM Context Window Scaling

Ning Shang; Li Lyna Zhang; Siyuan Wang; Gaokai Zhang; Gilsinia Lopez,; Fan Yang; Weizhu Chen; Mao Yang

arXiv:2502.20082·cs.CL·February 28, 2025

LongRoPE2: Near-Lossless LLM Context Window Scaling

Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez,, Fan Yang, Weizhu Chen, Mao Yang

PDF

Open Access 1 Repo

TL;DR

LongRoPE2 introduces a novel method to extend large language models' context windows to 128K tokens while maintaining high performance on shorter contexts, using efficient training and rescaling techniques.

Contribution

It presents a new RoPE rescaling algorithm and a mixed context training approach that significantly extend context length with minimal additional training data.

Findings

01

Achieves 128K effective context length on LLaMA3-8B.

02

Retains over 98.5% of short-context performance.

03

Requires only 10B tokens for training, much less than previous methods.

Abstract

LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/longrope
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsADaptive gradient method with the OPTimal convergence rate