MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models
Ying Zhang, Ziheng Yang, Shufan Ji

TL;DR
MLKD-BERT introduces a multi-level knowledge distillation approach that enhances BERT compression by exploring relation-level knowledge and flexible attention head settings, leading to improved performance and reduced inference time.
Contribution
The paper proposes MLKD-BERT, a novel multi-level knowledge distillation method that improves BERT compression by incorporating relation-level knowledge and flexible attention head configurations.
Findings
Outperforms state-of-the-art distillation methods on GLUE and QA tasks.
Enables flexible adjustment of attention heads with minimal performance loss.
Reduces inference time significantly while maintaining high accuracy.
Abstract
Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Linear Layer · Weight Decay · Residual Connection · Multi-Head Attention · WordPiece · Softmax · Layer Normalization
