TL;DR
This paper introduces a generic teacher network trained once to effectively transfer knowledge to various student architectures, reducing the need for repeated customization and improving efficiency in model compression.
Contribution
The paper proposes a one-off KD-aware training method to create a generic teacher capable of effective knowledge transfer across multiple student architectures.
Findings
Improves knowledge distillation effectiveness across diverse student models.
Reduces training cost by amortizing the generic teacher training.
Enhances flexibility in deploying compressed models on different hardware.
Abstract
Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
