Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation
Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

TL;DR
This paper introduces a feature-only knowledge distillation framework that surpasses traditional logit-based methods by leveraging latent representations and a new knowledge quality metric, achieving significant accuracy improvements.
Contribution
The authors propose a novel feature-based knowledge distillation method that excludes logit-based losses and uses a new metric to identify effective teacher layers, leading to state-of-the-art results.
Findings
Achieves up to 15% top-1 accuracy boost over standard methods.
Effective across diverse datasets and model architectures.
Introduces a knowledge quality metric for layer selection.
Abstract
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
