RUST: Latent Neural Scene Representations from Unposed Imagery
Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot,, Daniel Duckworth, Mario Lucic, Klaus Greff

TL;DR
RUST introduces a pose-free neural scene representation method trained solely on RGB images, enabling effective novel view synthesis and explicit pose estimation without requiring ground truth camera poses.
Contribution
It proposes RUST, a novel approach that learns latent scene and pose representations from unposed images, reducing reliance on accurate camera pose data for neural scene modeling.
Findings
RUST achieves comparable quality to pose-dependent methods in view synthesis.
The learned latent pose structure allows meaningful camera transformations.
RUST enables large-scale training of neural scene representations without pose supervision.
Abstract
Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Adam · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings
