Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao

TL;DR
This paper applies knowledge distillation to multi-task deep neural networks for natural language understanding, significantly improving performance on multiple GLUE benchmark tasks by training a single model to emulate ensemble teachers.
Contribution
It introduces a multi-task knowledge distillation approach that enhances the performance of a single MT-DNN by distilling knowledge from ensemble teachers, outperforming previous models on GLUE tasks.
Findings
Distilled MT-DNN outperforms original on 7 of 9 GLUE tasks.
Achieves 83.7% on GLUE benchmark, a 1.5% improvement.
Code and models will be publicly available.
Abstract
This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to \emph{distill} knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement\footnote{ Based on the GLUE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
