ScaleKD: Strong Vision Transformers Could Be Excellent Teachers
Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao

TL;DR
ScaleKD introduces a novel knowledge distillation method that leverages strong pre-trained vision transformers as teachers, effectively transferring knowledge across diverse architectures and scales, achieving state-of-the-art results and scalability benefits.
Contribution
The paper presents ScaleKD, a new KD approach that aligns feature differences and model scales, enabling effective distillation from large vision transformers to various student models.
Findings
Achieves state-of-the-art distillation performance across multiple architectures.
Demonstrates scalable benefits with larger teacher models and datasets.
Reduces training data requirements significantly, up to 195x.
Abstract
In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Methods and Technology
MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · ALIGN · Multi-Head Attention · Residual Connection · Vision Transformer · Knowledge Distillation
