VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei; Xiaoran Liu; Yuhang Zang; Xiaoyi Dong; Pan Zhang; Yuhang Cao; Jian Tong; Haodong Duan; Qipeng Guo; Jiaqi Wang; Xipeng Qiu; Dahua Lin

arXiv:2502.05173·cs.CV·June 2, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

PDF

Open Access 1 Repo 4 Models 2 Datasets 1 Video

TL;DR

This paper introduces VideoRoPE, a novel spatio-temporal position embedding for videos that addresses limitations of previous methods, improving performance on various video understanding tasks.

Contribution

The paper proposes VideoRoPE, a 3D structure with specific features to effectively adapt Rotary Position Embedding to video data, addressing prior challenges.

Findings

01

VideoRoPE outperforms previous RoPE variants on multiple tasks.

02

The V-NIAH-D task reveals limitations of prior RoPE variants.

03

VideoRoPE effectively preserves spatio-temporal relationships in videos.

Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wiselnn570/videorope
pytorchOfficial

Models

Datasets

Videos

VideoRoPE: What Makes for Good Video Rotary Position Embedding?· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Image and Video Stabilization · Hand Gesture Recognition Systems