TopKD: Top-scaled Knowledge Distillation
Qi Wang, Jinjia Zhou

TL;DR
This paper introduces TopKD, a novel logit-based knowledge distillation framework that adaptively amplifies informative logits, significantly improving distillation performance across various datasets and architectures.
Contribution
TopKD presents a simple, efficient, and architecture-agnostic method that enhances logit-based knowledge distillation by focusing on top-K logits with a novel scaling and loss approach.
Findings
TopKD outperforms existing methods on multiple datasets.
It effectively distills Vision Transformers.
The method is architecture-agnostic and easy to integrate.
Abstract
Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
