Heterogeneous Generative Knowledge Distillation with Masked Image Modeling
Ziming Wang, Shumin Han, Xiaodi Wang, Jing Hao, Xianbin Cao, Baochang, Zhang

TL;DR
This paper introduces Heterogeneous Generative Knowledge Distillation (H-GKD), a novel method that transfers knowledge from large Transformer models to small CNNs using masked image modeling, improving performance across multiple visual tasks.
Contribution
It is the first to apply masked image modeling for knowledge distillation between heterogeneous models, bridging Transformer and CNN architectures effectively.
Findings
Achieves state-of-the-art results in image classification, object detection, and segmentation.
Improves Resnet50 accuracy from 76.98% to 80.01% on ImageNet 1K.
Demonstrates robustness across various model sizes and tasks.
Abstract
Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Mutual Information Machine/Mask Image Modeling · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Knowledge Distillation
