Super Vision Transformer
Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Chunhua Shen, Rongrong Ji,, Liujuan Cao

TL;DR
SuperViT is a versatile vision transformer training paradigm that reduces computational costs and improves accuracy by enabling a single model to adapt to various input sizes and token retention rates, enhancing efficiency and performance.
Contribution
We introduce SuperViT, a novel training method for vision transformers that allows a single model to adapt to different computational budgets and input configurations, outperforming existing efficient transformer models.
Findings
Reduces 2x FLOPs of DeiT-S with increased accuracy
Outperforms SOTA models at the same FLOPs level
Provides a versatile transformer model adaptable to hardware constraints
Abstract
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2x FLOPs of DeiT-S while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer
