Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
Baiang Li, Wenhao Chai, Felix Heide

TL;DR
This paper introduces a knowledge distillation method that accelerates visual learning by freezing a weaker teacher and applying distillation only in early training, achieving significant speedups across multiple tasks.
Contribution
It proposes a universal, plug-and-play distillation strategy that speeds up training of strong models in visual tasks by early stopping distillation after surpassing teacher performance.
Findings
Achieves up to 4.8x epoch speedup on ImageNet and CIFAR classification.
Generalizes to object detection and diffusion generation, with 1.7x and 2.5x speedups respectively.
Validates as a universal speedup method for various visual learning tasks.
Abstract
Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
