VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu; Longteng Guo; Yepeng Tang; Tongtian Yue; Junxian Cai; Kai Ma; Qingbin Liu; Xi Chen; Jing Liu

arXiv:2502.11664·cs.AI·November 3, 2025

VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

VRoPE introduces a novel rotary position embedding tailored for Video-LLMs, effectively addressing spatial-temporal encoding challenges and improving video understanding and reasoning tasks.

Contribution

It presents a new balanced positional encoding method for Video-LLMs that overcomes limitations of previous adaptations like RoPE-3D.

Findings

01

VRoPE outperforms previous RoPE variants in video understanding tasks.

02

It achieves significant improvements in temporal reasoning.

03

The method ensures a more uniform spatial focus distribution.

Abstract

Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johncaged/vrope
noneOfficial

Videos

VRoPE: Rotary Position Embedding for Video Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need