Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Bowen Zheng; Ran Cheng

arXiv:2512.04625·cs.LG·December 5, 2025

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Bowen Zheng, Ran Cheng

PDF

Open Access

TL;DR

This paper rethinks decoupled knowledge distillation from a predictive distribution perspective, introducing GDKD with improved logits decoupling and insights into the teacher's predictive distribution, leading to superior performance across multiple benchmarks.

Contribution

It proposes a generalized GDKD loss with a novel partition strategy based on the teacher's predictive distribution, enhancing knowledge transfer and outperforming existing methods.

Findings

01

GDKD outperforms DKD and other methods on benchmarks.

02

Partitioning by top logit improves non-top logit relationships.

03

Focusing on non-top logits enhances knowledge extraction.

Abstract

In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications