Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation
Cody Blakeney, Jessica Zosa Forde, Jonathan Frankle, Ziliang Zong,, Matthew L. Leavitt

TL;DR
This paper demonstrates that knowledge distillation can significantly accelerate training of deep neural networks like ResNet-50 and BERT, providing up to nearly 2x speed-up, with simple optimizations further enhancing efficiency.
Contribution
The study systematically investigates how distillation improves training efficiency and introduces practical methods like early-stage distillation and single-teacher sampling to maximize benefits.
Findings
Distillation speeds up ResNet-50 training by up to 1.96x.
Distillation speeds up BERT training by up to 1.42x on GLUE.
Early-stage distillation (20-50%) yields optimal results for BERT.
Abstract
Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dense Connections · WordPiece · Linear Warmup With Linear Decay
