TinyViT: Fast Pretraining Distillation for Small Vision Transformers
Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong, Fu, Lu Yuan

TL;DR
TinyViT introduces a fast distillation framework to pretrain small, efficient vision transformers that achieve high accuracy on ImageNet-1k with significantly fewer parameters, enabling deployment on resource-limited devices.
Contribution
The paper presents a novel fast distillation pretraining method for small vision transformers, improving their accuracy and efficiency compared to existing models.
Findings
Achieves 84.8% top-1 accuracy on ImageNet-1k with 21M parameters.
Outperforms larger models like Swin-B and Swin-L in parameter efficiency.
Demonstrates strong transferability to various downstream tasks.
Abstract
Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/tiny_vit_5m_224.dist_in22kmodel· 2.8k dl2.8k dl
- 🤗timm/tiny_vit_5m_224.dist_in22k_ft_in1kmodel· 9.0k dl· ♡ 29.0k dl♡ 2
- 🤗timm/tiny_vit_5m_224.in1kmodel· 310 dl310 dl
- 🤗timm/tiny_vit_11m_224.dist_in22kmodel· 3.5k dl3.5k dl
- 🤗timm/tiny_vit_11m_224.dist_in22k_ft_in1kmodel· 261 dl261 dl
- 🤗timm/tiny_vit_11m_224.in1kmodel· 49 dl49 dl
- 🤗timm/tiny_vit_21m_224.dist_in22kmodel· 1.3k dl1.3k dl
- 🤗timm/tiny_vit_21m_224.dist_in22k_ft_in1kmodel· 16k dl16k dl
- 🤗timm/tiny_vit_21m_224.in1kmodel· 456 dl456 dl
- 🤗timm/tiny_vit_21m_384.dist_in22k_ft_in1kmodel· 1.0k dl· ♡ 31.0k dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
