MixKD: Towards Efficient Distillation of Large-scale Language Models

Kevin J Liang; Weituo Hao; Dinghan Shen; Yufan Zhou; Weizhu Chen,; Changyou Chen; Lawrence Carin

arXiv:2011.00593·cs.CL·March 18, 2021·30 cites

MixKD: Towards Efficient Distillation of Large-scale Language Models

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen,, Changyou Chen, Lawrence Carin

PDF

Open Access 1 Video

TL;DR

MixKD introduces a data-agnostic distillation method using mixup augmentation to improve the generalization of large language models, especially in low-resource scenarios, by encouraging the student to mimic the teacher on interpolated data.

Contribution

The paper proposes MixKD, a novel distillation framework that incorporates mixup augmentation to enhance model generalization and address data scarcity issues in large-scale language model compression.

Findings

01

MixKD outperforms standard knowledge distillation on GLUE benchmark.

02

MixKD shows significant gains in limited-data settings.

03

Ablation studies confirm the effectiveness of mixup in distillation.

Abstract

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MixKD: Towards Efficient Distillation of Large-scale Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation