Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang; Jiajun Liu; Jintao Zhang; Xiao Yang; Zhengwei Fang; Shizun Wang; Zijun Wang; Yingtian Zou; Hang Su; Jun Zhu

arXiv:2602.07854·cs.CV·February 24, 2026

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

PDF

Open Access

TL;DR

This paper introduces ViewRope, a geometry-aware encoding for video transformers that enhances 3D consistency and long-term stability in predictive world models by integrating camera-ray directions into attention mechanisms.

Contribution

The paper proposes ViewRope, a novel geometry-aware positional encoding that improves 3D consistency in video transformers for world modeling, addressing limitations of screen-space embeddings.

Findings

01

Significantly improves long-term scene consistency.

02

Reduces computational costs while maintaining memory fidelity.

03

Enhances loop-closure accuracy in predictive models.

Abstract

Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging