Dynamic Temperature Scheduler for Knowledge Distillation

Sibgat Ul Islam; Jawad Ibn Ahad; Fuad Rahman; Mohammad Ruhul Amin; Nabeel Mohammed; Shafin Rahman

arXiv:2511.13767·cs.LG·November 19, 2025

Dynamic Temperature Scheduler for Knowledge Distillation

Sibgat Ul Islam, Jawad Ibn Ahad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman

PDF

Open Access

TL;DR

This paper introduces a novel dynamic temperature scheduler for knowledge distillation that adapts the temperature parameter during training based on the divergence between teacher and student, improving performance across vision and NLP tasks.

Contribution

The paper proposes the first temperature scheduling method that adjusts based on teacher-student divergence, enhancing knowledge distillation effectiveness.

Findings

01

DTS outperforms static-temperature baselines on multiple datasets.

02

Dynamic adjustment of temperature improves student model training.

03

Method is effective across vision and NLP tasks.

Abstract

Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning