Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

TL;DR
This paper demonstrates that pre-training small models is crucial and competitive, and combining it with knowledge distillation from large models further enhances performance, providing a simple yet effective approach for compact NLP models.
Contribution
It highlights the importance of pre-training for small models and introduces Pre-trained Distillation, a straightforward method combining pre-training and knowledge distillation for improved compact models.
Findings
Pre-training small models remains important for performance.
Knowledge distillation from large models improves compact model accuracy.
Model size and data properties have a combined effect on results.
Abstract
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/bert_uncased_L-4_H-256_A-4model· 49k dl· ♡ 1449k dl♡ 14
- 🤗prajjwal1/bert-tinymodel· 769k dl· ♡ 140769k dl♡ 140
- 🤗aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616model· 1 dl1 dl
- 🤗aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616_squad2model· 4 dl4 dl
- 🤗aodiniz/bert_uncased_L-2_H-512_A-8_cord19-200616model· 1 dl1 dl
- 🤗aodiniz/bert_uncased_L-4_H-256_A-4_cord19-200616model· 1 dl1 dl
- 🤗dbmdz/bert-medium-historic-multilingual-casedmodel· 219 dl219 dl
- 🤗dbmdz/bert-mini-historic-multilingual-casedmodel· 53 dl· ♡ 353 dl♡ 3
- 🤗dbmdz/bert-small-historic-multilingual-casedmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗dbmdz/bert-tiny-historic-multilingual-casedmodel· 54 dl· ♡ 154 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Educational Assessment and Pedagogy · Education and Critical Thinking Development
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
