Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Zichong Li; Chen Liang; Liliang Ren; Tuo Zhao; Yelong Shen; Weizhu Chen

arXiv:2604.14339·cs.CL·April 17, 2026

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Zichong Li, Chen Liang, Liliang Ren, Tuo Zhao, Yelong Shen, Weizhu Chen

PDF

TL;DR

This paper introduces RoPE-Perturbed Self-Distillation, a regularization method that enhances the positional robustness of long-context language models by encouraging consistent predictions across context perturbations.

Contribution

It proposes a novel training regularizer that improves long-context adaptation by making models less sensitive to positional variations through self-distillation with context perturbations.

Findings

01

Achieves up to 12.04% improvement on RULER-64K for Llama-3-8B.

02

Gains 2.71% on RULER-256K for Qwen-3-4B after supervised fine-tuning.

03

Demonstrates improved length extrapolation beyond training context window.

Abstract

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.