Knowledge Distillation via Route Constrained Optimization
Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang,, Junjie Yan, Xiaolin Hu

TL;DR
This paper introduces route constrained optimization (RCO), a novel knowledge distillation method inspired by curriculum learning, which improves the training of small neural networks by using selected route points from the teacher model's parameter space.
Contribution
The paper proposes RCO, a new approach that reduces the congruence loss in knowledge distillation by routing through parameter space, enhancing performance on classification and face recognition tasks.
Findings
RCO improves accuracy on CIFAR100 by 2.14%.
RCO enhances ImageNet performance by 1.5%.
RCO demonstrates better generalization on MegaFace face recognition.
Abstract
Distillation-based learning boosts the performance of the miniaturized neural network based on the hypothesis that the representation of a teacher model can be used as structured and relatively weak supervision, and thus would be easily learned by a miniaturized model. However, we find that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a high lower bound of congruence loss. In this work, inspired by curriculum learning we consider the knowledge distillation from the perspective of curriculum learning by routing. Instead of supervising the student model with a converged teacher model, we supervised it with some anchor points selected from the route in parameter space that the teacher model passed by, as we called route constrained optimization (RCO). We experimentally demonstrate this simple operation greatly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsKnowledge Distillation
