Knowledge Distillation vs. Pretraining from Scratch under a Fixed   (Computation) Budget

Minh Duc Bui; Fabian David Schmidt; Goran Glava\v{s}; Katharina von; der Wense

arXiv:2404.19319·cs.CL·May 1, 2024

Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget

Minh Duc Bui, Fabian David Schmidt, Goran Glava\v{s}, Katharina von, der Wense

PDF

Open Access 1 Video

TL;DR

This paper compares knowledge distillation and pretraining from scratch for language models under fixed compute budgets, finding that advanced KD strategies outperform from-scratch training, especially with repeated data, challenging prior assumptions.

Contribution

It provides a fair experimental comparison showing that sophisticated KD methods can surpass from-scratch pretraining for smaller language models within fixed compute constraints.

Findings

01

Advanced KD strategies outperform from-scratch pretraining.

02

KD gains are larger when data is repeated under fixed compute.

03

Pretraining from scratch performs comparably to simple KD methods.

Abstract

Compared to standard language model (LM) pretraining (i.e., from scratch), Knowledge Distillation (KD) entails an additional forward pass through a teacher model that is typically substantially larger than the target student model. As such, KD in LM pretraining materially slows down throughput of pretraining instances vis-a-vis pretraining from scratch. Scaling laws of LM pretraining suggest that smaller models can close the gap to larger counterparts if trained on more data (i.e., processing more tokens)-and under a fixed computation budget, smaller models are able be process more data than larger models. We thus hypothesize that KD might, in fact, be suboptimal to pretraining from scratch for obtaining smaller LMs, when appropriately accounting for the compute budget. To test this, we compare pretraining from scratch against several KD strategies for masked language modeling (MLM) in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget· underline

Taxonomy

TopicsBig Data and Business Intelligence

MethodsKnowledge Distillation