Cameras as Relative Positional Encoding
Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

TL;DR
This paper introduces a novel relative positional encoding called PRoPE for transformers in multi-view vision tasks, demonstrating improved performance in novel view synthesis, depth estimation, and spatial cognition across various settings.
Contribution
The work proposes PRoPE, a new camera-relative positional encoding capturing full camera frustums, enhancing transformer performance in multi-view 3D perception tasks.
Findings
Relative camera conditioning improves novel view synthesis.
PRoPE yields additional gains over existing methods.
Benefits extend to multiple tasks and model sizes.
Abstract
Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhotography and Visual Culture
