MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language   Models

Ying Zhang; Ziheng Yang; Shufan Ji

arXiv:2407.02775·cs.CL·July 4, 2024

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Ying Zhang, Ziheng Yang, Shufan Ji

PDF

Open Access

TL;DR

MLKD-BERT introduces a multi-level knowledge distillation approach that enhances BERT compression by exploring relation-level knowledge and flexible attention head settings, leading to improved performance and reduced inference time.

Contribution

The paper proposes MLKD-BERT, a novel multi-level knowledge distillation method that improves BERT compression by incorporating relation-level knowledge and flexible attention head configurations.

Findings

01

Outperforms state-of-the-art distillation methods on GLUE and QA tasks.

02

Enables flexible adjustment of attention heads with minimal performance loss.

03

Reduces inference time significantly while maintaining high accuracy.

Abstract

Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Linear Layer · Weight Decay · Residual Connection · Multi-Head Attention · WordPiece · Softmax · Layer Normalization