TL;DR
URoPE extends rotary position embeddings to cross-view and cross-dimensional geometric spaces, enabling transformers to better handle geometric reasoning in diverse vision tasks.
Contribution
It introduces a parameter-free, intrinsics-aware positional encoding that generalizes RoPE to 3D and cross-view scenarios, improving transformer performance across multiple vision tasks.
Findings
URoPE improves transformer performance in view synthesis, 3D detection, and tracking.
It is invariant to global coordinate systems and compatible with existing RoPE kernels.
Experiments demonstrate URoPE's effectiveness across diverse geometric vision tasks.
Abstract
Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
