Rotary Position Embedding for Vision Transformer
Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun

TL;DR
This paper investigates the application of Rotary Position Embedding (RoPE) to Vision Transformers, demonstrating its effectiveness in improving performance and extrapolation capabilities across various vision tasks with minimal computational cost.
Contribution
It provides the first comprehensive analysis and practical guidelines for applying RoPE to Vision Transformers in computer vision tasks.
Findings
RoPE enhances Vision Transformer performance on ImageNet-1k, COCO, and ADE-20k.
RoPE maintains high precision with increased image resolution during inference.
RoPE offers significant extrapolation benefits with minimal additional computation.
Abstract
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗naver-ai/rope_mixed_deit_small_patch16_LSmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗naver-ai/rope_mixed_deit_base_patch16_LSmodel· 7 dl7 dl
- 🤗naver-ai/rope_mixed_deit_large_patch16_LSmodel· 5 dl5 dl
- 🤗naver-ai/rope_axial_deit_small_patch16_LSmodel· 5 dl5 dl
- 🤗naver-ai/rope_axial_deit_base_patch16_LSmodel· 7 dl7 dl
- 🤗naver-ai/rope_axial_deit_large_patch16_LSmodel· 11 dl11 dl
- 🤗naver-ai/rope_mixed_ape_deit_small_patch16_LSmodel· 5 dl5 dl
- 🤗naver-ai/rope_mixed_ape_deit_base_patch16_LSmodel· 6 dl6 dl
- 🤗naver-ai/rope_mixed_ape_deit_large_patch16_LSmodel· 5 dl5 dl
- 🤗naver-ai/rope_axial_ape_deit_small_patch16_LSmodel· 8 dl8 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · Advanced Measurement and Detection Methods · Optical Systems and Laser Technology
MethodsAttention Is All You Need · Dropout · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Adam · Transformer
