TL;DR
LinearARD is a novel self-distillation method that restores Rotary Position Embeddings in large language models with linear memory, achieving high performance on long-context tasks using minimal training data.
Contribution
The paper introduces LinearARD, a memory-efficient self-distillation technique that aligns attention-structure distributions to restore RoPE in LLMs, outperforming existing methods on long-context benchmarks.
Findings
Recovers 98.3% of short-text performance on LLaMA2-7B with extended context.
Uses only 4.25 million training tokens, significantly less than prior methods.
Surpasses state-of-the-art long-context benchmarks with minimal training data.
Abstract
The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense , , and self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
