Distilling Efficient Vision Transformers from CNNs for Semantic   Segmentation

Xu Zheng; Yunhao Luo; Pengyuan Zhou; Lin Wang

arXiv:2310.07265·cs.CV·October 12, 2023

Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation

Xu Zheng, Yunhao Luo, Pengyuan Zhou, Lin Wang

PDF

Open Access

TL;DR

This paper introduces C2VKD, a novel knowledge distillation framework that effectively transfers knowledge from CNNs to Vision Transformers for semantic segmentation, significantly improving performance over existing methods.

Contribution

The paper proposes a new CNN-to-ViT knowledge distillation framework with VLFD and PDD modules to bridge the capacity gap and heterogeneity between CNN teachers and ViT students.

Findings

01

Over 200% improvement in mIoU over state-of-the-art KD methods

02

Effective transfer of knowledge from CNNs to ViT models in segmentation tasks

03

Consistent performance gains across three benchmark datasets

Abstract

In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Linear Layer · Label Smoothing · Residual Connection · Adam · Absolute Position Encodings · Layer Normalization · Dense Connections