Rethinking and Improving Relative Position Encoding for Vision   Transformer

Kan Wu; Houwen Peng; Minghao Chen; Jianlong Fu; Hongyang; Chao

arXiv:2107.14222·cs.CV·July 30, 2021·24 cites

Rethinking and Improving Relative Position Encoding for Vision Transformer

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang, Chao

PDF

Open Access 1 Repo

TL;DR

This paper introduces new relative position encoding methods tailored for vision transformers, demonstrating significant accuracy improvements on image classification and object detection benchmarks without extra hyperparameter tuning.

Contribution

It proposes simple, lightweight 2D image-specific relative position encoding methods that outperform existing approaches and provide new insights into their effectiveness in vision transformers.

Findings

01

iRPE improves DeiT accuracy by up to 1.5% on ImageNet.

02

iRPE enhances DETR mAP by up to 1.3% on COCO.

03

Some findings challenge previous assumptions about position encoding in vision models.

Abstract

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/cream
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Label Smoothing · Residual Connection · Dense Connections · Softmax