LightViT: Towards Light-Weight Convolution-Free Vision Transformers
Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, Chang Xu

TL;DR
LightViT introduces a convolution-free, lightweight vision transformer that employs a global aggregation scheme with learnable tokens and attention mechanisms, achieving high accuracy with low computational cost.
Contribution
The paper proposes a novel global aggregation scheme for pure transformer blocks, eliminating the need for convolutions in lightweight vision transformers.
Findings
Achieves 78.7% accuracy on ImageNet with 0.7G FLOPs.
Outperforms PVTv2-B0 by 8.2% in accuracy.
11% faster on GPU compared to baseline models.
Abstract
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Visual Attention and Saliency Detection
