LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin; Hanwen Jiang; Hao Tan; Kai Zhang; Sai Bi; Tianyuan Zhang,; Fujun Luan; Noah Snavely; Zexiang Xu

arXiv:2410.17242·cs.CV·April 4, 2025

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang,, Fujun Luan, Noah Snavely, Zexiang Xu

PDF

Open Access 1 Models 3 Reviews

TL;DR

LVSM introduces two transformer-based models for novel view synthesis that do not rely on 3D inductive biases, achieving state-of-the-art results with improved scalability and generalization from sparse views.

Contribution

The paper presents LVSM, a fully data-driven transformer approach for view synthesis that eliminates traditional 3D biases, offering two architectures with superior performance.

Findings

01

Outperforms previous methods by 1.5 to 3.5 dB PSNR.

02

Achieves state-of-the-art quality across multiple datasets.

03

Operates efficiently with reduced computational resources.

Abstract

We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 5

Strengths

* The idea of achieving high-quality photorealistic NVS with minimal 3D inductive bias is brave. It is also impressive that LVSM implements this brave idea with a straightforward yet effective pure Transformer-based architecture. * Experiments on several benchmarks demonstrate the effectiveness of the introduced LVSM * The paper is well structured, and it is easy to follow.

Weaknesses

* More discussion with Scene Representation Transformer (SRT) [Sajjadi et. al, CVPR 22]. LVSM seems to be a ‘reimplementation’ of SRT with more recent modules, which significantly limits the novelty of LVSM. The discussions in L141-L146 cannot convince me about the key contribution of LVSM. A more thorough analysis is suggested below. * The introduction should clearly reveal the similarities and differences between SRT and LVSM. The motivation (minimal 3D inductive bias) and architecture (enco

Reviewer 02Rating 8Confidence 5

Strengths

The paper is well-motivated and and very well-written, though certain technical details could benefit from additional clarity (outlined below). The visual results are striking, as shown on the authors’ website, and I appreciate the authors provide additional results with limited GPU-hours, making reproduction more feasible for academic labs. Overall, this work is a valuable contribution to view synthesis research.

Weaknesses

1. Related Works. While the paper covers key prior work on 3D representation and few-shot view synthesis, it would benefit from a discussion of generative multi-view methods, especially recent works like Free3D (CVPR 2024, also uses Plucker embedding to encode camera poses) and EscherNet (CVPR 2024, also can be inferenced with varying number of reference/target views). These methods also do not rely on intermediate 3D representations, treating view synthesis as a sequence-to-sequence problem. Ad

Reviewer 03Rating 6Confidence 5

Strengths

### S1 -- Good results on an interesting task - The task of synthesizing novel views from a set of input views is interesting and very challenging. The proposed method seems to work well on both object-level dataset and scene-level dataset. - Base on the visual results shown in Figure 3 and 4, the proposed deterministic pipeline also can imagine the new content which is invisible from the input views. ### S2 -- Simple ideas and careful implementations - There are two main transformer-based arch

Weaknesses

### W1 --- Significant is not well demonstrated - The proposed idea is a very specific, minor change to SRT -- basically using a slightly different transformer encoder or decoder to replace the original CNN. Fundamentally, I am not fully convinced that it is even crucial to use only transformer architecture than the CNN-based feature extraction and then do the transform. - This small change seems to lead to a large improvement on both object-level and scene-level datasets. However, if we train t

Code & Models

Models

🤗
coast01/LVSM
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging