Integrating Knowledge Distillation Methods: A Sequential Multi-Stage Framework
Yinxi Tian, Changwu Huang, Ke Tang, and Xin Yao

TL;DR
This paper introduces SMSKD, a flexible, multi-stage framework for integrating various knowledge distillation methods to improve student model accuracy efficiently, while mitigating forgetting and supporting arbitrary combinations.
Contribution
The paper proposes SMSKD, a novel sequential multi-stage distillation framework that integrates heterogeneous KD methods with adaptive weighting and reference models to enhance performance.
Findings
Consistently improves student accuracy across architectures.
Supports arbitrary method combinations with negligible overhead.
Stage-wise distillation and adaptive weighting significantly boost results.
Abstract
Knowledge distillation (KD) transfers knowledge from large teacher models to compact student models, enabling efficient deployment on resource constrained devices. While diverse KD methods, including response based, feature based, and relation based approaches, capture different aspects of teacher knowledge, integrating multiple methods or knowledge sources is promising but often hampered by complex implementation, inflexible combinations, and catastrophic forgetting, which limits practical effectiveness. This work proposes SMSKD (Sequential Multi Stage Knowledge Distillation), a flexible framework that sequentially integrates heterogeneous KD methods. At each stage, the student is trained with a specific distillation method, while a frozen reference model from the previous stage anchors learned knowledge to mitigate forgetting. In addition, we introduce an adaptive weighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Domain Adaptation and Few-Shot Learning
