Distillation Scaling Laws

Dan Busbridge; Amitis Shidani; Floris Weers; Jason Ramapuram; Etai Littwin; Russ Webb

arXiv:2502.08606·cs.LG·July 28, 2025

Distillation Scaling Laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

PDF

Open Access 1 Video

TL;DR

This paper introduces a distillation scaling law that guides optimal compute allocation between teacher and student models, improving distillation efficiency and informing experimental strategies.

Contribution

It presents a novel distillation scaling law and compute-optimal recipes for different scenarios, enhancing understanding of distillation performance at scale.

Findings

01

Distillation outperforms supervised learning at large compute levels with multiple students or existing teachers.

02

Supervised learning is preferable when only one student and a new teacher are involved.

03

Large-scale analysis improves understanding and experimental design in model distillation.

Abstract

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Distillation Scaling Laws· slideslive

Taxonomy

TopicsProcess Optimization and Integration