TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Kan Wu; Jinnian Zhang; Houwen Peng; Mengchen Liu; Bin Xiao; Jianlong; Fu; Lu Yuan

arXiv:2207.10666·cs.CV·July 22, 2022·22 cites

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong, Fu, Lu Yuan

PDF

Open Access 3 Repos 10 Models

TL;DR

TinyViT introduces a fast distillation framework to pretrain small, efficient vision transformers that achieve high accuracy on ImageNet-1k with significantly fewer parameters, enabling deployment on resource-limited devices.

Contribution

The paper presents a novel fast distillation pretraining method for small vision transformers, improving their accuracy and efficiency compared to existing models.

Findings

01

Achieves 84.8% top-1 accuracy on ImageNet-1k with 21M parameters.

02

Outperforms larger models like Swin-B and Swin-L in parameter efficiency.

03

Demonstrates strong transferability to various downstream tasks.

Abstract

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques