Scene Representation Transformer: Geometry-Free Novel View Synthesis   Through Set-Latent Scene Representations

Mehdi S. M. Sajjadi; Henning Meyer; Etienne Pot; Urs Bergmann; and Klaus Greff; Noha Radwan; Suhani Vora; Mario Lucic; Daniel; Duckworth; Alexey Dosovitskiy; Jakob Uszkoreit; Thomas Funkhouser; and Andrea Tagliasacchi

arXiv:2111.13152·cs.CV·March 30, 2022

Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, and Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel, Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi

PDF

Open Access 1 Repo

TL;DR

The paper introduces Scene Representation Transformer (SRT), a fast, geometry-free method for novel view synthesis from RGB images using set-latent scene representations, outperforming existing methods in speed and quality.

Contribution

SRT generalizes Vision Transformers to sets of images, enabling efficient, geometry-free 3D scene understanding and novel view synthesis in a single feed-forward pass.

Findings

01

Outperforms recent baselines in PSNR and speed on synthetic datasets

02

Supports interactive visualization and semantic segmentation of real-world scenes

03

Operates effectively with posed or unposed RGB images

Abstract

A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stelzner/srt
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Byte Pair Encoding · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax