Knowledge Distillation Based on Transformed Teacher Matching
Kaixiang Zheng, En-Hui Yang

TL;DR
This paper introduces transformed teacher matching (TTM), a novel knowledge distillation method that omits student-side temperature scaling, leveraging inherent regularization for improved generalization, and further enhances it with weighted TTM (WTTM) for state-of-the-art results.
Contribution
The paper proposes TTM, a new KD variant that removes student temperature scaling and incorporates Rènyi entropy regularization, along with WTTM, an adaptive weighting scheme for better performance.
Findings
TTM outperforms traditional KD in generalization.
WTTM achieves state-of-the-art accuracy.
Inherent regularization improves student model performance.
Abstract
As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's…
Peer Reviews
Decision·ICLR 2024 poster
I think overall the paper provides new findings to understand the role of temperature in knowledge distillation. And the evaluation experiments are extensive. 1. The theoretical derivation and analysis for the general KD, Renyi entropy, and transformed teacher matching is precise and solid. 2. Extensive experiments confirm the theoretical analysis and show the effectiveness of each proposed module.
1. It's better to provide a detailed summary and comparison of the latest related works. 2. It's also more convincing to show results on transformer models such as ViT.
- Fruitful discussion about related works to engage the readers. - Theoretical derivation from KD to the proposed TTM.
The results are completely dependent on the list T and β values of all experiments (see Table 8 and 9), which makes the method impractical. Furthermore, the optimal value may even vary from task to task, dataset to dataset and backbone to backbone. These are my main concerns. Based on the marginal gain compared to the baselines, these empirical results actually weaken the claimed contribution.
1. The method that rethinking KD via temperature scaling is interesting. 2. The final TTM does not introduce extra hyper-parameters. Also, the training speed keeps the same. 3. The results on various datasets and models prove its effectiveness.
1. Some references and comparisons are missing: [1] Knowledge distillation from a stronger teacher. [2] From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels. [3] Curriculum Temperature for Knowledge Distillation. [4] VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale. 2. When temperature=1, is TTM the same as the original KD? In some papers, the temperature on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment
MethodsKnowledge Distillation
