LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Tao Huang; Lang Huang; Shan You; Fei Wang; Chen Qian; Chang Xu

arXiv:2207.05557·cs.CV·July 13, 2022·42 cites

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, Chang Xu

PDF

Open Access 1 Repo

TL;DR

LightViT introduces a convolution-free, lightweight vision transformer that employs a global aggregation scheme with learnable tokens and attention mechanisms, achieving high accuracy with low computational cost.

Contribution

The paper proposes a novel global aggregation scheme for pure transformer blocks, eliminating the need for convolutions in lightweight vision transformers.

Findings

01

Achieves 78.7% accuracy on ImageNet with 0.7G FLOPs.

02

Outperforms PVTv2-B0 by 8.2% in accuracy.

03

11% faster on GPU compared to baseline models.

Abstract

Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hunto/lightvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Visual Attention and Saliency Detection