Well-Read Students Learn Better: On the Importance of Pre-training   Compact Models

Iulia Turc; Ming-Wei Chang; Kenton Lee; Kristina Toutanova

arXiv:1908.08962·cs.CL·September 27, 2019·428 cites

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper demonstrates that pre-training small models is crucial and competitive, and combining it with knowledge distillation from large models further enhances performance, providing a simple yet effective approach for compact NLP models.

Contribution

It highlights the importance of pre-training for small models and introduces Pre-trained Distillation, a straightforward method combining pre-training and knowledge distillation for improved compact models.

Findings

01

Pre-training small models remains important for performance.

02

Knowledge distillation from large models improves compact model accuracy.

03

Model size and data properties have a combined effect on results.

Abstract

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Educational Assessment and Pedagogy · Education and Critical Thinking Development

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax