Sequence and Circle: Exploring the Relationship Between Patches
Zhengyang Yu, Jochen Triesch

TL;DR
This paper investigates alternative spatial encoding methods for vision transformers, proposing sequence and circle relationship embeddings that can replace or complement learnable position embeddings, reducing parameters while maintaining or improving performance.
Contribution
It introduces two novel spatial encoding schemes, SRE and CRE, that replace or enhance traditional learnable position embeddings in vision transformers.
Findings
SRE and CRE can replace learnable PE with similar performance.
Combining SRE or CRE with PE improves accuracy.
Proposed methods reduce learnable parameters in ViT models.
Abstract
The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding (CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Vision Transformer
