MoKD: Multi-Task Optimization for Knowledge Distillation

Zeeshan Hayder; Ali Cheraghian; Lars Petersson; Mehrtash Harandi

arXiv:2505.08170·cs.CV·August 5, 2025

MoKD: Multi-Task Optimization for Knowledge Distillation

Zeeshan Hayder, Ali Cheraghian, Lars Petersson, Mehrtash Harandi

PDF

TL;DR

MoKD introduces a multi-task optimization framework for knowledge distillation that addresses gradient conflicts and dominance, leading to more effective and efficient training of compact models with state-of-the-art results.

Contribution

It reformulates knowledge distillation as a multi-objective optimization problem and employs a subspace learning framework to enhance knowledge transfer.

Findings

01

Outperforms existing KD methods on ImageNet-1K and COCO datasets.

02

Achieves state-of-the-art performance with greater efficiency.

03

Models trained with MoKD outperform models trained from scratch.

Abstract

Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation