Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Zony Yu; Yuqiao Wen; Lili Mou

arXiv:2502.04499·cs.LG·December 11, 2025

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Zony Yu, Yuqiao Wen, Lili Mou

PDF

Open Access 1 Video

TL;DR

This paper investigates layer-selection strategies in knowledge distillation and finds that the choice of strategy has minimal impact on student performance, challenging previous assumptions about their importance.

Contribution

The study reveals that layer-selection strategies in KD are not crucial, showing that even nonsensical strategies perform well, simplifying KD design considerations.

Findings

01

Layer-selection strategy has little effect on KD performance.

02

Reverse matching yields comparable results to forward matching.

03

Angles between teacher layers explain the robustness of different strategies.

Abstract

Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching -- even seemingly nonsensical matching strategies such as reverse matching still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn’t Matter (Much)· underline

Taxonomy

TopicsSemantic Web and Ontologies