Short Data, Long Context: Distilling Positional Knowledge in Transformers
Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov, Md Rifat Arefin, Adithya Sagar

TL;DR
This paper shows that long-context capabilities can be transferred to smaller models via logit-based knowledge distillation, using short-context training and Rotary Position Embedding techniques.
Contribution
It introduces a method for transferring long-context knowledge through distillation without long-context pre-training, analyzing positional information transfer and update patterns.
Findings
Phase-wise RoPE scaling improves long-context performance.
Positional information influences the teacher's output distribution.
Structured update patterns emerge in the query state during long-context extension.
Abstract
Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
