AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer
Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis

TL;DR
AnchorFormer introduces differentiable anchor tokens to efficiently approximate global self-attention in vision transformers, significantly reducing computational complexity and improving performance across vision tasks.
Contribution
The paper proposes a novel differentiable anchor-based attention mechanism that reduces complexity and accelerates inference in vision transformers, applicable to multiple vision tasks.
Findings
Achieves up to 9.0% higher accuracy on ImageNet classification.
Reduces FLOPs by 46.7% compared to baselines.
Improves mAP by 81.3% on COCO detection.
Abstract
Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given patches, they will have quadratic complexity such as and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from to , where is an anchor number and . Notably, by representing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Neural Network Applications
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Dense Connections · Vision Transformer
