Consistently Informative Soft-Label Temperature for Knowledge Distillation
Hoang-Chau Luong, Nghia Van Vo, Kaiqi Zhao, Lingwei Chen

TL;DR
This paper introduces CIST, an adaptive temperature scaling method for knowledge distillation that produces more consistent and informative soft labels by adjusting for sample difficulty and logit scale differences.
Contribution
CIST proposes sample-wise adaptive temperatures for teacher and student, improving knowledge distillation by addressing fixed-temperature limitations and enhancing performance.
Findings
CIST produces more informative soft labels across samples.
Experiments show improved accuracy over standard KD methods.
Method incurs negligible additional computational cost.
Abstract
Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions and exposing informative "dark knowledge" beyond the hard label. However, the standard fixed-temperature design is inherently sample-agnostic. Since samples differ in logit scale and learning difficulty, a single global temperature produces teacher soft labels with highly inconsistent entropy: some predictions remain overly sharp and provide limited inter-class information, whereas others become over-smoothed and lose class-discriminative information. Moreover, sharing the same temperature between teacher and student further imposes rigid logit-scale alignment despite their capacity mismatch. To address these limitations, we propose CIST (Consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
