Cross-Architecture Knowledge Distillation
Yufan Liu, Jiajiong Cao, Bing Li, Weiming Hu, Jingting Ding, Liang Li

TL;DR
This paper introduces a novel cross-architecture knowledge distillation method that effectively transfers knowledge from Transformer models to CNNs, improving performance and robustness across various datasets.
Contribution
It proposes a new distillation approach using cross attention and group-wise linear projectors for cross-architecture transfer, along with a multi-view robust training scheme.
Findings
Outperforms 14 state-of-the-art methods on multiple datasets.
Effective in transferring knowledge from Transformer to CNN.
Enhances robustness and stability of the distillation process.
Abstract
Transformer attracts much attention because of its ability to learn global relations and superior performance. In order to achieve higher performance, it is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous-architecture distillation, such as distilling knowledge from CNN to CNN. They may not be suitable when applying to cross-architecture scenarios, such as from Transformer to CNN. To deal with this problem, a novel cross-architecture knowledge distillation method is proposed. Specifically, instead of directly mimicking output/intermediate features of the teacher, partially cross attention projector and group-wise linear projector are introduced to align the student features with the teacher's in two projected feature spaces. And a multi-view robust training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
