Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation
Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou

TL;DR
Warmup-Distill is a method that pre-aligns the student model's knowledge with the teacher's to reduce distribution mismatch, leading to improved performance in knowledge distillation for large language models.
Contribution
It introduces a simple pre-distillation alignment technique that enhances the effectiveness of knowledge distillation by addressing distribution mismatch early on.
Findings
Outperforms vanilla student in 7 benchmarks with at least +0.4 score increase.
Improves math task accuracy by up to +1.9%.
Effective in reducing distribution mismatch in early distillation stages.
Abstract
The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Critical Thinking Development
MethodsKnowledge Distillation
