ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Jiawei Fan; Chao Li; Xiaolong Liu; Anbang Yao

arXiv:2411.06786·cs.CV·November 12, 2024

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao

PDF

Open Access 1 Repo 1 Video

TL;DR

ScaleKD introduces a novel knowledge distillation method that leverages strong pre-trained vision transformers as teachers, effectively transferring knowledge across diverse architectures and scales, achieving state-of-the-art results and scalability benefits.

Contribution

The paper presents ScaleKD, a new KD approach that aligns feature differences and model scales, enabling effective distillation from large vision transformers to various student models.

Findings

01

Achieves state-of-the-art distillation performance across multiple architectures.

02

Demonstrates scalable benefits with larger teacher models and datasets.

03

Reduces training data requirements significantly, up to 195x.

Abstract

In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deep-optimization/scalekd
pytorchOfficial

Videos

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers· slideslive

Taxonomy

TopicsEducational Methods and Technology

MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · ALIGN · Multi-Head Attention · Residual Connection · Vision Transformer · Knowledge Distillation