Warmup-Distill: Bridge the Distribution Mismatch between Teacher and   Student before Knowledge Distillation

Zengkui Sun; Yijin Liu; Fandong Meng; Yufeng Chen; Jinan Xu; Jie Zhou

arXiv:2502.11766·cs.CL·February 18, 2025

Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

Warmup-Distill is a method that pre-aligns the student model's knowledge with the teacher's to reduce distribution mismatch, leading to improved performance in knowledge distillation for large language models.

Contribution

It introduces a simple pre-distillation alignment technique that enhances the effectiveness of knowledge distillation by addressing distribution mismatch early on.

Findings

01

Outperforms vanilla student in 7 benchmarks with at least +0.4 score increase.

02

Improves math task accuracy by up to +1.9%.

03

Effective in reducing distribution mismatch in early distillation stages.

Abstract

The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

acerkoo/warmupdistill
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Critical Thinking Development

MethodsKnowledge Distillation