VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, Yunhe Wang

TL;DR
This paper revisits vanilla knowledge distillation, demonstrating its effectiveness on large-scale datasets like ImageNet-1K when combined with strong data augmentation and training strategies, achieving state-of-the-art results.
Contribution
It reveals the small data pitfall in previous KD methods and shows vanilla KD's potential in large-scale scenarios with simple techniques.
Findings
Vanilla KD performs strongly on large datasets like ImageNet-1K.
Stronger data augmentation reduces the gap between vanilla and advanced KD methods.
State-of-the-art accuracy achieved with vanilla KD on multiple models.
Abstract
The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the \emph{small data pitfall} that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Advanced Neural Network Applications · Artificial Intelligence in Healthcare and Education
MethodsKnowledge Distillation
