Advancing Vision Transformer with Enhanced Spatial Priors
Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He

TL;DR
This paper introduces EVT, an advanced Vision Transformer that incorporates Euclidean spatial priors and flexible token grouping, significantly improving performance across multiple vision tasks without extra data.
Contribution
The paper proposes EVT, a novel Vision Transformer that enhances spatial modeling with Euclidean distance decay and flexible attention grouping, surpassing previous methods in accuracy and adaptability.
Findings
EVT achieves 86.6% top-1 accuracy on ImageNet-1k without extra data.
EVT outperforms prior ViT models on object detection and segmentation tasks.
The proposed spatial priors improve the modeling of spatial relationships in vision transformers.
Abstract
In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
