TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao; Yichun Yin; Lifeng Shang; Xin Jiang; Xiao Chen; Linlin; Li; Fang Wang; Qun Liu

arXiv:1909.10351·cs.CL·October 19, 2020·136 cites

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin, Li, Fang Wang, Qun Liu

PDF

Open Access 5 Repos 10 Models

TL;DR

TinyBERT is a compact, efficient version of BERT created through a novel Transformer knowledge distillation method and a two-stage learning framework, achieving high accuracy with significantly reduced size and faster inference.

Contribution

The paper introduces a new Transformer distillation technique and a two-stage training process specifically designed for creating small, high-performance BERT variants.

Findings

01

TinyBERT with 4 layers achieves over 96.8% of BERTBASE performance on GLUE.

02

TinyBERT is 7.5x smaller and 9.4x faster than BERTBASE.

03

TinyBERT with 6 layers matches BERTBASE performance.

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia?