Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability
Ruifei He, Shuyang Sun, Jihan Yang, Song Bai, Xiaojuan Qi

TL;DR
This paper introduces a novel feature-based knowledge distillation method for pre-training that significantly reduces data and time requirements while maintaining competitive performance on downstream tasks.
Contribution
It proposes a new feature-based KD approach with non-parametric feature alignment, enabling efficient pre-training without extensive data or time.
Findings
Achieves comparable performance to supervised pre-training on multiple tasks
Requires 10x less data and 5x less pre-training time
Effective transfer of learned features to downstream applications
Abstract
Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
