VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

TL;DR
This paper introduces VideoRoPE, a novel spatio-temporal position embedding for videos that addresses limitations of previous methods, improving performance on various video understanding tasks.
Contribution
The paper proposes VideoRoPE, a 3D structure with specific features to effectively adapt Rotary Position Embedding to video data, addressing prior challenges.
Findings
VideoRoPE outperforms previous RoPE variants on multiple tasks.
The V-NIAH-D task reveals limitations of prior RoPE variants.
VideoRoPE effectively preserves spatio-temporal relationships in videos.
Abstract
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Wiselnn/Qwen2-VL-videorope-128frames-8k-context-330k-llava-videomodel· 10 dl10 dl
- 🤗Wiselnn/Qwen2-VL-m_rope-128frames-8k-context-330k-llava-videomodel· 3 dl3 dl
- 🤗Wiselnn/Qwen2-VL-vanilla_rope-128frames-8k-context-330k-llava-videomodel· 5 dl5 dl
- 🤗Wiselnn/Qwen2-VL-tad_rope-128frames-8k-context-330k-llava-videomodel· 3 dl3 dl
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Image and Video Stabilization · Hand Gesture Recognition Systems
