Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation
Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky,, Marc Najork, Chao Zhang

TL;DR
This paper introduces PTLoss, a novel knowledge distillation method that perturbs the traditional loss to produce a proxy teacher closer to ground truth, improving student model performance.
Contribution
The paper proposes PTLoss, a new distillation objective that perturbs the KL loss to better align with ground truth, backed by theoretical analysis and extensive experiments.
Findings
PTLoss improves distillation performance across multiple datasets.
Theoretical link between distribution closeness and model generalizability.
Significant gains over traditional KL-based distillation methods.
Abstract
Knowledge distillation is a popular technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution. Therefore, forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss implicitly transforms the original teacher into a proxy teacher with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsKnowledge Distillation
