MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models
Linqing Liu, Huan Wang, Jimmy Lin, Richard Socher, Caiming Xiong

TL;DR
This paper introduces MKD, a multi-task knowledge distillation framework that enhances lightweight language models by jointly distilling multiple tasks, improving generalization and efficiency across different architectures.
Contribution
It proposes a general, model-agnostic multi-task distillation method applicable to various architectures, outperforming task-specific approaches in efficiency and effectiveness.
Findings
Achieves better performance than similar LSTM-based methods under same constraints.
Reaches comparable results to state-of-the-art with faster inference.
Applicable to Transformer and LSTM models.
Abstract
Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a light-weight student model. So far the distillation approaches are all task-specific. In this paper, we explore knowledge distillation under the multi-task learning setting. The student is jointly distilled across different tasks. It acquires more general representation capacity through multi-tasking distillation and can be further fine-tuned to improve the model in the target domain. Unlike other BERT distillation methods which specifically designed for Transformer-based architectures, we provide a general learning framework. Our approach is model agnostic and can be easily applied on different future teacher model architectures. We evaluate our approach on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Knowledge Distillation · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
