DualToken-ViT: Position-aware Efficient Vision Transformer with Dual   Token Fusion

Zhenzhen Chu; Jiayu Chen; Cen Chen; Chengyu Wang; Ziheng Wu; Jun; Huang; Weining Qian

arXiv:2309.12424·cs.CV·September 25, 2023·1 cites

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun, Huang, Weining Qian

PDF

Open Access

TL;DR

DualToken-ViT is a lightweight, efficient vision transformer that combines local convolutional features with global self-attention, enhanced by position-aware global tokens, achieving high accuracy with low computational cost across vision tasks.

Contribution

This paper introduces DualToken-ViT, a novel model that fuses local and global information using position-aware tokens, improving efficiency and performance over existing ViTs.

Findings

01

Achieves 75.4% and 79.4% accuracy on ImageNet-1K with 0.5G and 1.0G FLOPs.

02

Outperforms LightViT-T by 0.7% with 1.0G FLOPs.

03

Effective across image classification, detection, and segmentation tasks.

Abstract

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer