One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation
Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang,, Chang Xu

TL;DR
This paper introduces OFA-KD, a novel knowledge distillation framework that effectively bridges the gap between heterogeneous neural network architectures by projecting features into an aligned space and using adaptive target enhancement.
Contribution
The paper proposes OFA-KD, a simple yet effective method for cross-architecture knowledge distillation that improves performance across CNNs, Transformers, and MLPs.
Findings
Achieves up to 8.0% accuracy gain on CIFAR-100.
Improves 0.7% accuracy on ImageNet-1K.
Demonstrates effectiveness across diverse architectures.
Abstract
Knowledge distillation~(KD) has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. However, most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family, particularly the hint-based approaches. By using centered kernel alignment (CKA) to compare the learned features between heterogeneous teacher and student models, we observe significant feature divergence. This divergence illustrates the ineffectiveness of previous hint-based methods in cross-architecture distillation. To tackle the challenge in distilling heterogeneous models, we propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures. Specifically, we project intermediate features into an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings
