Improved Knowledge Distillation for Pre-trained Language Models via   Knowledge Selection

Chenglong Wang; Yi Lu; Yongyu Mu; Yimin Hu; Tong Xiao; Jingbo Zhu

arXiv:2302.00444·cs.CL·February 2, 2023

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Chenglong Wang, Yi Lu, Yongyu Mu, Yimin Hu, Tong Xiao, Jingbo Zhu

PDF

Open Access

TL;DR

This paper introduces an actor-critic based method for selective knowledge transfer in distilling pre-trained language models, improving efficiency and performance by dynamically choosing relevant knowledge during training.

Contribution

We propose a novel actor-critic approach for dynamic knowledge selection in distillation, enhancing the effectiveness and efficiency of training smaller language models.

Findings

01

Our method outperforms strong baselines on GLUE datasets.

02

Selective knowledge transfer improves distillation efficiency.

03

Dynamic knowledge selection benefits model performance.

Abstract

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsKnowledge Distillation