GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model
Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Yang Yang,, Hongyin Tang, Keqing He, Jiahao Liu, Jingang Wang, Shu Zhao, Peng Zhang, Jie, Tang

TL;DR
GKD is a versatile framework enabling efficient knowledge distillation of extremely large-scale pre-trained language models, supporting multiple methods and facilitating deployment on memory-limited hardware.
Contribution
The paper introduces GKD, a flexible framework that allows large-scale PLM distillation with various methods on limited GPU memory, supporting models over 100B parameters.
Findings
Supports distillation of 100B-scale PLMs
Enables use of 25 distillation methods
Operates on 8 NVIDIA A100 GPUs
Abstract
Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsKnowledge Distillation
