Extending LLMs' Context Window with 100 Samples
Yikai Zhang, Junlong Li, Pengfei Liu

TL;DR
This paper introduces a novel method to extend LLMs' context window efficiently by adjusting RoPE parameters, validated on LLaMA-2-7B-Chat with minimal data and training, improving performance and robustness.
Contribution
A new extension to RoPE that combines frequency adjustment and attention scaling, enabling large context windows with minimal samples and training steps.
Findings
Extends LLaMA-2-7B-Chat context to 16,384 tokens with only 100 samples.
Improves fine-tuning performance and robustness across different context sizes.
Demonstrates the effectiveness of data composition and training curricula for context extension.
Abstract
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Indoor and Outdoor Localization Technologies · Multimodal Machine Learning Applications
MethodsPathways Language Model · Balanced Selection · Shrink and Fine-Tune · GPT-NeoX
