AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

Jiquan Shan; Junxiao Wang; Lifeng Zhao; Liang Cai; Hongyuan Zhang; Ioannis Liritzis

arXiv:2505.16463·cs.CV·June 23, 2025

AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis

PDF

Open Access

TL;DR

AnchorFormer introduces differentiable anchor tokens to efficiently approximate global self-attention in vision transformers, significantly reducing computational complexity and improving performance across vision tasks.

Contribution

The paper proposes a novel differentiable anchor-based attention mechanism that reduces complexity and accelerates inference in vision transformers, applicable to multiple vision tasks.

Findings

01

Achieves up to 9.0% higher accuracy on ImageNet classification.

02

Reduces FLOPs by 46.7% compared to baselines.

03

Improves mAP by 81.3% on COCO detection.

Abstract

Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $O (n^{2})$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $O (n^{2})$ to $O (mn)$ , where $m$ is an anchor number and $m < n$ . Notably, by representing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Neural Network Applications

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Dense Connections · Vision Transformer