TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin, Li, Fang Wang, Qun Liu

TL;DR
TinyBERT is a compact, efficient version of BERT created through a novel Transformer knowledge distillation method and a two-stage learning framework, achieving high accuracy with significantly reduced size and faster inference.
Contribution
The paper introduces a new Transformer distillation technique and a two-stage training process specifically designed for creating small, high-performance BERT variants.
Findings
TinyBERT with 4 layers achieves over 96.8% of BERTBASE performance on GLUE.
TinyBERT is 7.5x smaller and 9.4x faster than BERTBASE.
TinyBERT with 6 layers matches BERTBASE performance.
Abstract
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗huawei-noah/TinyBERT_General_4L_312Dmodel· 75k dl· ♡ 7675k dl♡ 76
- 🤗DataikuNLP/TinyBERT_General_4L_312Dmodel· 33 dl· ♡ 133 dl♡ 1
- 🤗deepset/tinybert-6l-768d-squad2model· 628 dl· ♡ 2628 dl♡ 2
- 🤗deepset/tinyroberta-6l-768dmodel· 80 dl· ♡ 380 dl♡ 3
- 🤗deepset/tinyroberta-squad2model· 120k dl· ♡ 113120k dl♡ 113
- 🤗dvm1983/TinyBERT_General_4L_312D_demodel· 29 dl· ♡ 329 dl♡ 3
- 🤗C5i/SEAD-L-6_H-256_A-8-sst2model· 6 dl6 dl
- 🤗C5i/SEAD-L-6_H-384_A-12-sst2model· 10 dl10 dl
- 🤗C5i/SEAD-L-6_H-384_A-12-mrpcmodel· 3 dl3 dl
- 🤗C5i/SEAD-L-6_H-256_A-8-mrpcmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia?
