Geometry-Aware Rotary Position Embedding for Consistent Video World Model
Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

TL;DR
This paper introduces ViewRope, a geometry-aware encoding for video transformers that enhances 3D consistency and long-term stability in predictive world models by integrating camera-ray directions into attention mechanisms.
Contribution
The paper proposes ViewRope, a novel geometry-aware positional encoding that improves 3D consistency in video transformers for world modeling, addressing limitations of screen-space embeddings.
Findings
Significantly improves long-term scene consistency.
Reduces computational costs while maintaining memory fidelity.
Enhances loop-closure accuracy in predictive models.
Abstract
Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
