Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper; Lijun Chen; Sailesh Dwivedy; Danna Gurari

arXiv:2511.14981·cs.CV·November 20, 2025

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

PDF

Open Access

TL;DR

This paper introduces a feature-only knowledge distillation framework that surpasses traditional logit-based methods by leveraging latent representations and a new knowledge quality metric, achieving significant accuracy improvements.

Contribution

The authors propose a novel feature-based knowledge distillation method that excludes logit-based losses and uses a new metric to identify effective teacher layers, leading to state-of-the-art results.

Findings

01

Achieves up to 15% top-1 accuracy boost over standard methods.

02

Effective across diverse datasets and model architectures.

03

Introduces a knowledge quality metric for layer selection.

Abstract

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications