Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber; Ernie Chang; Chinnadhurai Sankar; Rylan Conway; Igor Fedorov; Md Rifat Arefin; Adithya Sagar

arXiv:2604.06070·cs.CL·April 8, 2026

Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov, Md Rifat Arefin, Adithya Sagar

PDF

TL;DR

This paper shows that long-context capabilities can be transferred to smaller models via logit-based knowledge distillation, using short-context training and Rotary Position Embedding techniques.

Contribution

It introduces a method for transferring long-context knowledge through distillation without long-context pre-training, analyzing positional information transfer and update patterns.

Findings

01

Phase-wise RoPE scaling improves long-context performance.

02

Positional information influences the teacher's output distribution.

03

Structured update patterns emerge in the query state during long-context extension.

Abstract

Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.