DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun, Huang, Weining Qian

TL;DR
DualToken-ViT is a lightweight, efficient vision transformer that combines local convolutional features with global self-attention, enhanced by position-aware global tokens, achieving high accuracy with low computational cost across vision tasks.
Contribution
This paper introduces DualToken-ViT, a novel model that fuses local and global information using position-aware tokens, improving efficiency and performance over existing ViTs.
Findings
Achieves 75.4% and 79.4% accuracy on ImageNet-1K with 0.5G and 1.0G FLOPs.
Outperforms LightViT-T by 0.7% with 1.0G FLOPs.
Effective across image classification, detection, and segmentation tasks.
Abstract
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer
