Cumulative Spatial Knowledge Distillation for Vision Transformers
Borui Zhao, Renjie Song, Jiajun Liang

TL;DR
This paper introduces CSKD, a novel knowledge distillation method for vision transformers that effectively transfers spatial knowledge from CNNs without intermediate features, enhancing training efficiency and performance.
Contribution
The paper proposes CSKD with a Cumulative Knowledge Fusion module to improve spatial knowledge transfer from CNNs to ViTs, addressing semantic level mismatch and training limitations.
Findings
CSKD outperforms existing methods on ImageNet-1k.
The CKF module improves ViT training by balancing local and global features.
Enhanced downstream task performance with CSKD.
Abstract
Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection
MethodsKnowledge Distillation
