Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation
Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang

TL;DR
This paper introduces C2VKD, a novel knowledge distillation framework that effectively transfers knowledge from CNNs to Vision Transformers for semantic segmentation, significantly improving performance over existing methods.
Contribution
The paper proposes a new CNN-to-ViT knowledge distillation framework with VLFD and PDD modules to bridge the capacity gap and heterogeneity between CNN teachers and ViT students.
Findings
Over 200% improvement in mIoU over state-of-the-art KD methods
Effective transfer of knowledge from CNNs to ViT models in segmentation tasks
Consistent performance gains across three benchmark datasets
Abstract
In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Linear Layer · Label Smoothing · Residual Connection · Adam · Absolute Position Encodings · Layer Normalization · Dense Connections
