Distillation Scaling Laws
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

TL;DR
This paper introduces a distillation scaling law that guides optimal compute allocation between teacher and student models, improving distillation efficiency and informing experimental strategies.
Contribution
It presents a novel distillation scaling law and compute-optimal recipes for different scenarios, enhancing understanding of distillation performance at scale.
Findings
Distillation outperforms supervised learning at large compute levels with multiple students or existing teachers.
Supervised learning is preferable when only one student and a new teacher are involved.
Large-scale analysis improves understanding and experimental design in model distillation.
Abstract
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsProcess Optimization and Integration
