Patient Knowledge Distillation for BERT Model Compression

Siqi Sun; Yu Cheng; Zhe Gan; Jingjing Liu

arXiv:1908.09355·cs.CL·August 27, 2019

Patient Knowledge Distillation for BERT Model Compression

Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu

PDF

5 Repos 1 Models

TL;DR

This paper introduces a novel patient knowledge distillation method that compresses large BERT models into smaller, efficient models by learning from multiple intermediate layers, improving training efficiency without losing accuracy.

Contribution

It proposes two new multi-layer distillation strategies, PKD-Last and PKD-Skip, that leverage rich intermediate representations for better model compression.

Findings

01

Improved NLP task performance with smaller models

02

Significant training efficiency gains

03

Maintained accuracy in compressed models

Abstract

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: ( $i$ ) PKD-Last: learning from the last $k$ layers; and ( $ii$ ) PKD-Skip: learning from every $k$ layers. These two patient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
hadangvu/pkd-albert-student
model· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Knowledge Distillation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece