MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression   of Pre-Trained Transformers

Wenhui Wang; Furu Wei; Li Dong; Hangbo Bao; Nan Yang; Ming Zhou

arXiv:2002.10957·cs.CL·April 7, 2020·632 cites

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

This paper introduces a novel deep self-attention distillation method to compress large pre-trained Transformer models into smaller, efficient models while maintaining high accuracy across NLP tasks.

Contribution

It proposes a new distillation approach focusing on self-attention modules, including the scaled dot-product of values, and demonstrates improved performance with smaller models.

Findings

01

Small models retain over 99% accuracy on SQuAD 2.0.

02

Achieves competitive results on GLUE benchmarks.

03

Outperforms state-of-the-art baselines in model compression.

Abstract

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm
pytorchOfficial

Models

Videos

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers· slideslive

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections