MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

TL;DR
This paper introduces a novel deep self-attention distillation method to compress large pre-trained Transformer models into smaller, efficient models while maintaining high accuracy across NLP tasks.
Contribution
It proposes a new distillation approach focusing on self-attention modules, including the scaled dot-product of values, and demonstrates improved performance with smaller models.
Findings
Small models retain over 99% accuracy on SQuAD 2.0.
Achieves competitive results on GLUE benchmarks.
Outperforms state-of-the-art baselines in model compression.
Abstract
Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/MiniLM-L12-H384-uncasedmodel· 25k dl· ♡ 10725k dl♡ 107
- 🤗microsoft/Multilingual-MiniLM-L12-H384model· 60k dl· ♡ 10060k dl♡ 100
- 🤗C5i/SEAD-L-6_H-256_A-8-sst2model· 6 dl6 dl
- 🤗C5i/SEAD-L-6_H-384_A-12-sst2model· 10 dl10 dl
- 🤗BM-K/KoMiniLMmodel· 99 dl· ♡ 499 dl♡ 4
- 🤗C5i/SEAD-L-6_H-384_A-12-mrpcmodel· 3 dl3 dl
- 🤗C5i/SEAD-L-6_H-256_A-8-mrpcmodel· 2 dl2 dl
- 🤗C5i/SEAD-L-6_H-256_A-8-rtemodel· 2 dl2 dl
- 🤗C5i/SEAD-L-6_H-384_A-12-rtemodel· 3 dl3 dl
- 🤗C5i/SEAD-L-6_H-256_A-8-stsbmodel· 1 dl1 dl
Videos
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
