Gradient Knowledge Distillation for Pre-trained Language Models

Lean Wang; Lei Li; Xu Sun

arXiv:2211.01071·cs.CL·November 3, 2022

Gradient Knowledge Distillation for Pre-trained Language Models

Lean Wang, Lei Li, Xu Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces Gradient Knowledge Distillation (GKD), a novel method that incorporates teacher gradients into the distillation process, leading to improved student performance and interpretability in pre-trained language models.

Contribution

GKD is the first to integrate gradient alignment into knowledge distillation for language models, enhancing transfer effectiveness and interpretability.

Findings

01

GKD outperforms previous KD methods in student performance.

02

Incorporating gradients improves student-teacher behavior consistency.

03

Gradient knowledge enhances model interpretability.

Abstract

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lancopku/gkd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsKnowledge Distillation