LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Ning Yang; Hengyu Zhong; Wentao Wang; Baoliang Tian; Haijun Zhang; Jun Wang

arXiv:2604.00004·cs.CL·April 2, 2026

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang

PDF

1 Repo

TL;DR

LinearARD is a novel self-distillation method that restores Rotary Position Embeddings in large language models with linear memory, achieving high performance on long-context tasks using minimal training data.

Contribution

The paper introduces LinearARD, a memory-efficient self-distillation technique that aligns attention-structure distributions to restore RoPE in LLMs, outperforming existing methods on long-context benchmarks.

Findings

01

Recovers 98.3% of short-text performance on LLaMA2-7B with extended context.

02

Uses only 4.25 million training tokens, significantly less than prior methods.

03

Surpasses state-of-the-art long-context benchmarks with minimal training data.

Abstract

The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q / Q$ , $K / K$ , and $V / V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gracefulning/LinearARD
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.