Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge   Distillation

Rongzhi Zhang; Jiaming Shen; Tianqi Liu; Jialu Liu; Michael Bendersky,; Marc Najork; Chao Zhang

arXiv:2305.05010·cs.LG·May 10, 2023·5 cites

Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation

Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky,, Marc Najork, Chao Zhang

PDF

Open Access

TL;DR

This paper introduces PTLoss, a novel knowledge distillation method that perturbs the traditional loss to produce a proxy teacher closer to ground truth, improving student model performance.

Contribution

The paper proposes PTLoss, a new distillation objective that perturbs the KL loss to better align with ground truth, backed by theoretical analysis and extensive experiments.

Findings

01

PTLoss improves distillation performance across multiple datasets.

02

Theoretical link between distribution closeness and model generalizability.

03

Significant gains over traditional KL-based distillation methods.

Abstract

Knowledge distillation is a popular technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution. Therefore, forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss implicitly transforms the original teacher into a proxy teacher with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification

MethodsKnowledge Distillation