TopKD: Top-scaled Knowledge Distillation

Qi Wang; Jinjia Zhou

arXiv:2508.04539·cs.CV·August 7, 2025

TopKD: Top-scaled Knowledge Distillation

Qi Wang, Jinjia Zhou

PDF

TL;DR

This paper introduces TopKD, a novel logit-based knowledge distillation framework that adaptively amplifies informative logits, significantly improving distillation performance across various datasets and architectures.

Contribution

TopKD presents a simple, efficient, and architecture-agnostic method that enhances logit-based knowledge distillation by focusing on top-K logits with a novel scaling and loss approach.

Findings

01

TopKD outperforms existing methods on multiple datasets.

02

It effectively distills Vision Transformers.

03

The method is architecture-agnostic and easy to integrate.

Abstract

Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.