Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation
Zichong Li, Chen Liang, Liliang Ren, Tuo Zhao, Yelong Shen, Weizhu Chen

TL;DR
This paper introduces RoPE-Perturbed Self-Distillation, a regularization method that enhances the positional robustness of long-context language models by encouraging consistent predictions across context perturbations.
Contribution
It proposes a novel training regularizer that improves long-context adaptation by making models less sensitive to positional variations through self-distillation with context perturbations.
Findings
Achieves up to 12.04% improvement on RULER-64K for Llama-3-8B.
Gains 2.71% on RULER-256K for Qwen-3-4B after supervised fine-tuning.
Demonstrates improved length extrapolation beyond training context window.
Abstract
Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
