Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection
Chenglong Wang, Yi Lu, Yongyu Mu, Yimin Hu, Tong Xiao, Jingbo Zhu

TL;DR
This paper introduces an actor-critic based method for selective knowledge transfer in distilling pre-trained language models, improving efficiency and performance by dynamically choosing relevant knowledge during training.
Contribution
We propose a novel actor-critic approach for dynamic knowledge selection in distillation, enhancing the effectiveness and efficiency of training smaller language models.
Findings
Our method outperforms strong baselines on GLUE datasets.
Selective knowledge transfer improves distillation efficiency.
Dynamic knowledge selection benefits model performance.
Abstract
Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsKnowledge Distillation
