ViTKD: Practical Guidelines for ViT feature knowledge distillation

Zhendong Yang; Zhe Li; Ailing Zeng; Zexian Li; Chun Yuan; Yu Li

arXiv:2209.02432·cs.CV·September 7, 2022·23 cites

ViTKD: Practical Guidelines for ViT feature knowledge distillation

Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, Yu Li

PDF

Open Access 1 Repo

TL;DR

This paper investigates feature-based knowledge distillation for Vision Transformers (ViT), proposing practical guidelines and a new method ViTKD that significantly improves student model performance on ImageNet-1k.

Contribution

It introduces the first set of practical guidelines for ViT feature distillation and proposes ViTKD, a novel method that enhances ViT student performance, complementing existing logit-based KD.

Findings

01

ViTKD improves DeiT-Tiny from 74.42% to 76.06% on ImageNet-1k.

02

ViTKD boosts DeiT-Small from 80.55% to 81.95%.

03

Combining ViTKD with logit-based KD yields further performance gains.

Abstract

Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yzd-v/cls_KD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Dropout