DOT: A Distillation-Oriented Trainer
Borui Zhao, Quan Cui, Renjie Song, Jiajun Liang

TL;DR
The paper introduces DOT, a new training method for knowledge distillation that separately optimizes task and distillation losses, leading to better convergence and improved model accuracy.
Contribution
DOT employs a gradient separation and momentum adjustment technique to effectively optimize both losses simultaneously, overcoming previous trade-offs.
Findings
DOT achieves a +2.59% accuracy on ImageNet-1k with ResNet50-MobileNetV1.
It improves loss convergence and model generalization.
Extensive experiments validate the effectiveness of DOT.
Abstract
Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · COVID-19 diagnosis using AI · Advanced Neural Network Applications
