Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng; En-Hui Yang

arXiv:2402.11148·cs.LG·March 11, 2024·5 cites

Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng, En-Hui Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces transformed teacher matching (TTM), a novel knowledge distillation method that omits student-side temperature scaling, leveraging inherent regularization for improved generalization, and further enhances it with weighted TTM (WTTM) for state-of-the-art results.

Contribution

The paper proposes TTM, a new KD variant that removes student temperature scaling and incorporates Rènyi entropy regularization, along with WTTM, an adaptive weighting scheme for better performance.

Findings

01

TTM outperforms traditional KD in generalization.

02

WTTM achieves state-of-the-art accuracy.

03

Inherent regularization improves student model performance.

Abstract

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

I think overall the paper provides new findings to understand the role of temperature in knowledge distillation. And the evaluation experiments are extensive. 1. The theoretical derivation and analysis for the general KD, Renyi entropy, and transformed teacher matching is precise and solid. 2. Extensive experiments confirm the theoretical analysis and show the effectiveness of each proposed module.

Weaknesses

1. It's better to provide a detailed summary and comparison of the latest related works. 2. It's also more convincing to show results on transformer models such as ViT.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- Fruitful discussion about related works to engage the readers. - Theoretical derivation from KD to the proposed TTM.

Weaknesses

The results are completely dependent on the list T and β values of all experiments (see Table 8 and 9), which makes the method impractical. Furthermore, the optimal value may even vary from task to task, dataset to dataset and backbone to backbone. These are my main concerns. Based on the marginal gain compared to the baselines, these empirical results actually weaken the claimed contribution.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The method that rethinking KD via temperature scaling is interesting. 2. The final TTM does not introduce extra hyper-parameters. Also, the training speed keeps the same. 3. The results on various datasets and models prove its effectiveness.

Weaknesses

1. Some references and comparisons are missing: [1] Knowledge distillation from a stronger teacher. [2] From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels. [3] Curriculum Temperature for Knowledge Distillation. [4] VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale. 2. When temperature=1, is TTM the same as the original KD? In some papers, the temperature on

Code & Models

Repositories

zkxufo/TTM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment

MethodsKnowledge Distillation