LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

Zican Dong; Junyi Li; Jinhao Jiang; Mingyu Xu; Wayne Xin Zhao; Bingning Wang; Weipeng Chen

arXiv:2502.07365·cs.CL·May 29, 2025

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen

PDF

Open Access 1 Video

TL;DR

LongReD is a novel training method that reduces performance loss on short-text tasks in large language models with extended context windows by using restoration distillation techniques.

Contribution

The paper introduces LongReD, a new pre-training approach that mitigates short-text degradation in long-context LLMs through distribution alignment and distillation.

Findings

01

LongReD preserves short-text performance effectively.

02

It maintains or improves long-text handling capacity.

03

Experimental results outperform baseline models.

Abstract

Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need