TL;DR
This paper introduces a novel patient knowledge distillation method that compresses large BERT models into smaller, efficient models by learning from multiple intermediate layers, improving training efficiency without losing accuracy.
Contribution
It proposes two new multi-layer distillation strategies, PKD-Last and PKD-Skip, that leverage rich intermediate representations for better model compression.
Findings
Improved NLP task performance with smaller models
Significant training efficiency gains
Maintained accuracy in compressed models
Abstract
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: () PKD-Last: learning from the last layers; and () PKD-Skip: learning from every layers. These two patient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- intersun/PKD-for-BERT-Model-CompressionpytorchOfficial
- Daniel-H-99/Patient-Knowledge-Distillationpytorch
- eunanomist/PKD_BERTpytorch
- MindSpore-scientific/code-11/tree/main/Patient2Vec-A-Personalized-Interpretablemindspore
- MindSpore-scientific/code-13/tree/main/Patient2Vec-A-Personalized-Interpretablemindspore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Knowledge Distillation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece
