Rethinking and Improving Relative Position Encoding for Vision Transformer
Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang, Chao

TL;DR
This paper introduces new relative position encoding methods tailored for vision transformers, demonstrating significant accuracy improvements on image classification and object detection benchmarks without extra hyperparameter tuning.
Contribution
It proposes simple, lightweight 2D image-specific relative position encoding methods that outperform existing approaches and provide new insights into their effectiveness in vision transformers.
Findings
iRPE improves DeiT accuracy by up to 1.5% on ImageNet.
iRPE enhances DETR mAP by up to 1.3% on COCO.
Some findings challenge previous assumptions about position encoding in vision models.
Abstract
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Label Smoothing · Residual Connection · Dense Connections · Softmax
