ViTKD: Practical Guidelines for ViT feature knowledge distillation
Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, Yu Li

TL;DR
This paper investigates feature-based knowledge distillation for Vision Transformers (ViT), proposing practical guidelines and a new method ViTKD that significantly improves student model performance on ImageNet-1k.
Contribution
It introduces the first set of practical guidelines for ViT feature distillation and proposes ViTKD, a novel method that enhances ViT student performance, complementing existing logit-based KD.
Findings
ViTKD improves DeiT-Tiny from 74.42% to 76.06% on ImageNet-1k.
ViTKD boosts DeiT-Small from 80.55% to 81.95%.
Combining ViTKD with logit-based KD yields further performance gains.
Abstract
Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Dropout
