TL;DR
This paper introduces Plackett-Luce Distillation (PLD), a novel list-wise ranking loss for knowledge distillation that models teacher logits as class rankings, leading to improved model compression.
Contribution
It recasts knowledge distillation using a choice-theoretic approach with the Plackett-Luce model, introducing a convex, ranking-based loss that enhances transfer of teacher knowledge.
Findings
PLD achieves consistent performance improvements across multiple datasets.
It outperforms traditional divergence and correlation-based distillation methods.
The approach is effective for diverse architectures and teacher-student configurations.
Abstract
Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
