Rethinking Local Perception in Lightweight Vision Transformer

Qihang Fan; Huaibo Huang; Jiyang Guan; Ran He

arXiv:2303.17803·cs.CV·June 2, 2023·32 cites

Rethinking Local Perception in Lightweight Vision Transformer

Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He

PDF

Open Access 1 Repo

TL;DR

CloFormer is a lightweight vision transformer that effectively captures local and global features by combining context-aware local enhancement with attention mechanisms, improving performance across vision tasks.

Contribution

The paper introduces AttnConv, a novel convolution operator in attention style, and demonstrates how combining local and global information enhances lightweight vision transformers.

Findings

01

CloFormer outperforms existing lightweight models in image classification.

02

It achieves superior results in object detection and semantic segmentation.

03

The model reduces FLOPs while maintaining high accuracy.

Abstract

Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qhfan/CloFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Convolution · Dense Connections · Linear Layer · Layer Normalization · Softmax · Residual Connection · Vision Transformer