Dynamic Knowledge Distillation for Pre-trained Language Models
Lei Li, Yankai Lin, Shuhuai Ren, Peng Li, Jie Zhou, Xu Sun

TL;DR
This paper introduces a dynamic knowledge distillation approach for pre-trained language models, allowing the student to adapt its learning process based on performance, data informativeness, and objective contributions, leading to improved efficiency and performance.
Contribution
It proposes a novel dynamic KD framework that adjusts teacher selection, data usage, and objectives during training, enhancing model compression and training efficiency.
Findings
Proper teacher selection boosts student performance.
Using 10% informative data achieves comparable results faster.
Adjusting alignment objectives improves student outcomes.
Abstract
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10% informative instances achieves comparable performance while greatly accelerates the training; (3) the student performance can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
