LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen

TL;DR
LongReD is a novel training method that reduces performance loss on short-text tasks in large language models with extended context windows by using restoration distillation techniques.
Contribution
The paper introduces LongReD, a new pre-training approach that mitigates short-text degradation in long-context LLMs through distribution alignment and distillation.
Findings
LongReD preserves short-text performance effectively.
It maintains or improves long-text handling capacity.
Experimental results outperform baseline models.
Abstract
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
