Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, and Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel, Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi

TL;DR
The paper introduces Scene Representation Transformer (SRT), a fast, geometry-free method for novel view synthesis from RGB images using set-latent scene representations, outperforming existing methods in speed and quality.
Contribution
SRT generalizes Vision Transformers to sets of images, enabling efficient, geometry-free 3D scene understanding and novel view synthesis in a single feed-forward pass.
Findings
Outperforms recent baselines in PSNR and speed on synthetic datasets
Supports interactive visualization and semantic segmentation of real-world scenes
Operates effectively with posed or unposed RGB images
Abstract
A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Byte Pair Encoding · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax
