Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget
Minh Duc Bui, Fabian David Schmidt, Goran Glava\v{s}, Katharina von, der Wense

TL;DR
This paper compares knowledge distillation and pretraining from scratch for language models under fixed compute budgets, finding that advanced KD strategies outperform from-scratch training, especially with repeated data, challenging prior assumptions.
Contribution
It provides a fair experimental comparison showing that sophisticated KD methods can surpass from-scratch pretraining for smaller language models within fixed compute constraints.
Findings
Advanced KD strategies outperform from-scratch pretraining.
KD gains are larger when data is repeated under fixed compute.
Pretraining from scratch performs comparably to simple KD methods.
Abstract
Compared to standard language model (LM) pretraining (i.e., from scratch), Knowledge Distillation (KD) entails an additional forward pass through a teacher model that is typically substantially larger than the target student model. As such, KD in LM pretraining materially slows down throughput of pretraining instances vis-a-vis pretraining from scratch. Scaling laws of LM pretraining suggest that smaller models can close the gap to larger counterparts if trained on more data (i.e., processing more tokens)-and under a fixed computation budget, smaller models are able be process more data than larger models. We thus hypothesize that KD might, in fact, be suboptimal to pretraining from scratch for obtaining smaller LMs, when appropriately accounting for the compute budget. To test this, we compare pretraining from scratch against several KD strategies for masked language modeling (MLM) in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBig Data and Business Intelligence
MethodsKnowledge Distillation
