Rotary Position Embedding for Vision Transformer

Byeongho Heo; Song Park; Dongyoon Han; Sangdoo Yun

arXiv:2403.13298·cs.CV·July 17, 2024·1 cites

Rotary Position Embedding for Vision Transformer

Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun

PDF

Open Access 2 Repos 10 Models

TL;DR

This paper investigates the application of Rotary Position Embedding (RoPE) to Vision Transformers, demonstrating its effectiveness in improving performance and extrapolation capabilities across various vision tasks with minimal computational cost.

Contribution

It provides the first comprehensive analysis and practical guidelines for applying RoPE to Vision Transformers in computer vision tasks.

Findings

01

RoPE enhances Vision Transformer performance on ImageNet-1k, COCO, and ADE-20k.

02

RoPE maintains high precision with increased image resolution during inference.

03

RoPE offers significant extrapolation benefits with minimal additional computation.

Abstract

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · Advanced Measurement and Detection Methods · Optical Systems and Laser Technology

MethodsAttention Is All You Need · Dropout · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Adam · Transformer