TL;DR
This paper introduces AMTML-KD, a novel framework for knowledge distillation that adaptively learns from multiple teachers at different levels, improving student network performance.
Contribution
It proposes a new adaptive multi-teacher multi-level distillation method that assigns importance weights to teachers and gathers intermediate hints from multiple sources.
Findings
Student models outperform strong competitors on public datasets.
Adaptive weighting improves the relevance of teacher knowledge.
Multi-level distillation enhances learning effectiveness.
Abstract
Knowledge distillation~(KD) is an effective learning paradigm for improving the performance of lightweight student networks by utilizing additional supervision knowledge distilled from teacher networks. Most pioneering studies either learn from only a single teacher in their distillation learning methods, neglecting the potential that a student can learn from multiple teachers simultaneously, or simply treat each teacher to be equally important, unable to reveal the different importance of teachers for specific examples. To bridge this gap, we propose a novel adaptive multi-teacher multi-level knowledge distillation learning framework~(AMTML-KD), which consists two novel insights: (i) associating each teacher with a latent representation to adaptively learn instance-level teacher importance weights which are leveraged for acquiring integrated soft-targets~(high-level knowledge) and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
