Prune Your Model Before Distill It
Jinhyuk Park, Albert No

TL;DR
This paper introduces a novel approach called 'prune, then distill' which prunes the teacher model before distillation to improve transferability and reduce generalization error, leading to more effective neural network compression.
Contribution
The paper proposes a new framework that prunes the teacher model prior to distillation, providing theoretical insights and demonstrating improved transferability and regularization effects.
Findings
Pruned teachers outperform unpruned ones in distillation.
Pruning acts as a regularizer reducing generalization error.
The method enhances neural network compression efficiency.
Abstract
Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferable knowledge. In this work, we propose the novel framework, "prune, then distill," that prunes the model first to make it more transferrable and then distill it to the student. We provide several exploratory examples where the pruned teacher teaches better than the original unpruned networks. We further show theoretically that the pruned teacher plays the role of regularizer in distillation, which reduces the generalization error. Based on this result, we propose a novel neural network compression scheme where the student network is formed based on the pruned teacher and then apply the "prune, then distill" strategy. The code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
