Super Vision Transformer

Mingbao Lin; Mengzhao Chen; Yuxin Zhang; Chunhua Shen; Rongrong Ji,; Liujuan Cao

arXiv:2205.11397·cs.CV·July 20, 2023

Super Vision Transformer

Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Chunhua Shen, Rongrong Ji,, Liujuan Cao

PDF

Open Access 1 Repo

TL;DR

SuperViT is a versatile vision transformer training paradigm that reduces computational costs and improves accuracy by enabling a single model to adapt to various input sizes and token retention rates, enhancing efficiency and performance.

Contribution

We introduce SuperViT, a novel training method for vision transformers that allows a single model to adapt to different computational budgets and input configurations, outperforming existing efficient transformer models.

Findings

01

Reduces 2x FLOPs of DeiT-S with increased accuracy

02

Outperforms SOTA models at the same FLOPs level

03

Provides a versatile transformer model adaptable to hardware constraints

Abstract

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2x FLOPs of DeiT-S while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lmbxmu/supervit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer