Rethinking Local Perception in Lightweight Vision Transformer
Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He

TL;DR
CloFormer is a lightweight vision transformer that effectively captures local and global features by combining context-aware local enhancement with attention mechanisms, improving performance across vision tasks.
Contribution
The paper introduces AttnConv, a novel convolution operator in attention style, and demonstrates how combining local and global information enhances lightweight vision transformers.
Findings
CloFormer outperforms existing lightweight models in image classification.
It achieves superior results in object detection and semantic segmentation.
The model reduces FLOPs while maintaining high accuracy.
Abstract
Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Convolution · Dense Connections · Linear Layer · Layer Normalization · Softmax · Residual Connection · Vision Transformer
