A 2D Semantic-Aware Position Encoding for Vision Transformers

Xi Chen; Shiyang Zhou; Muqi Huang; Jiaxu Feng; Yun Xiong; Kun Zhou; Biao Yang; Yuhui Zhang; Huishuai Bao; Sijia Peng; Chuan Li; Feng Shi

arXiv:2505.09466·cs.CV·May 15, 2025

A 2D Semantic-Aware Position Encoding for Vision Transformers

Xi Chen, Shiyang Zhou, Muqi Huang, Jiaxu Feng, Yun Xiong, Kun Zhou, Biao Yang, Yuhui Zhang, Huishuai Bao, Sijia Peng, Chuan Li, Feng Shi

PDF

Open Access

TL;DR

This paper introduces SaPE², a semantic-aware 2D position encoding for vision transformers that improves their ability to understand spatial relationships by considering content similarity, leading to better generalization and performance.

Contribution

We propose SaPE², a novel position encoding that dynamically adapts based on local content, addressing limitations of traditional fixed position encodings in vision transformers.

Findings

01

Enhanced model generalization across resolutions and scales

02

Improved translation equivariance in vision transformers

03

Better aggregation of features for similar but distant patches

Abstract

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches. Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often neglecting the semantic similarity between distant yet contextually related patches. These limitations hinder model generalization, translation equivariance, and the ability to effectively handle repetitive or structured patterns in images. In this paper, we propose 2-Dimensional Semantic-Aware Position Encoding ( $SaPE^{2}$ ), a novel position encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Infrared Target Detection Methodologies · Advanced Image and Video Retrieval Techniques

MethodsFocus