MoKD: Multi-Task Optimization for Knowledge Distillation
Zeeshan Hayder, Ali Cheraghian, Lars Petersson, Mehrtash Harandi

TL;DR
MoKD introduces a multi-task optimization framework for knowledge distillation that addresses gradient conflicts and dominance, leading to more effective and efficient training of compact models with state-of-the-art results.
Contribution
It reformulates knowledge distillation as a multi-objective optimization problem and employs a subspace learning framework to enhance knowledge transfer.
Findings
Outperforms existing KD methods on ImageNet-1K and COCO datasets.
Achieves state-of-the-art performance with greater efficiency.
Models trained with MoKD outperform models trained from scratch.
Abstract
Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
