LongRoPE2: Near-Lossless LLM Context Window Scaling
Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez,, Fan Yang, Weizhu Chen, Mao Yang

TL;DR
LongRoPE2 introduces a novel method to extend large language models' context windows to 128K tokens while maintaining high performance on shorter contexts, using efficient training and rescaling techniques.
Contribution
It presents a new RoPE rescaling algorithm and a mixed context training approach that significantly extend context length with minimal additional training data.
Findings
Achieves 128K effective context length on LLaMA3-8B.
Retains over 98.5% of short-context performance.
Requires only 10B tokens for training, much less than previous methods.
Abstract
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsADaptive gradient method with the OPTimal convergence rate
