LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang,, Fujun Luan, Noah Snavely, Zexiang Xu

TL;DR
LVSM introduces two transformer-based models for novel view synthesis that do not rely on 3D inductive biases, achieving state-of-the-art results with improved scalability and generalization from sparse views.
Contribution
The paper presents LVSM, a fully data-driven transformer approach for view synthesis that eliminates traditional 3D biases, offering two architectures with superior performance.
Findings
Outperforms previous methods by 1.5 to 3.5 dB PSNR.
Achieves state-of-the-art quality across multiple datasets.
Operates efficiently with reduced computational resources.
Abstract
We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only…
Peer Reviews
Decision·ICLR 2025 Oral
* The idea of achieving high-quality photorealistic NVS with minimal 3D inductive bias is brave. It is also impressive that LVSM implements this brave idea with a straightforward yet effective pure Transformer-based architecture. * Experiments on several benchmarks demonstrate the effectiveness of the introduced LVSM * The paper is well structured, and it is easy to follow.
* More discussion with Scene Representation Transformer (SRT) [Sajjadi et. al, CVPR 22]. LVSM seems to be a ‘reimplementation’ of SRT with more recent modules, which significantly limits the novelty of LVSM. The discussions in L141-L146 cannot convince me about the key contribution of LVSM. A more thorough analysis is suggested below. * The introduction should clearly reveal the similarities and differences between SRT and LVSM. The motivation (minimal 3D inductive bias) and architecture (enco
The paper is well-motivated and and very well-written, though certain technical details could benefit from additional clarity (outlined below). The visual results are striking, as shown on the authors’ website, and I appreciate the authors provide additional results with limited GPU-hours, making reproduction more feasible for academic labs. Overall, this work is a valuable contribution to view synthesis research.
1. Related Works. While the paper covers key prior work on 3D representation and few-shot view synthesis, it would benefit from a discussion of generative multi-view methods, especially recent works like Free3D (CVPR 2024, also uses Plucker embedding to encode camera poses) and EscherNet (CVPR 2024, also can be inferenced with varying number of reference/target views). These methods also do not rely on intermediate 3D representations, treating view synthesis as a sequence-to-sequence problem. Ad
### S1 -- Good results on an interesting task - The task of synthesizing novel views from a set of input views is interesting and very challenging. The proposed method seems to work well on both object-level dataset and scene-level dataset. - Base on the visual results shown in Figure 3 and 4, the proposed deterministic pipeline also can imagine the new content which is invisible from the input views. ### S2 -- Simple ideas and careful implementations - There are two main transformer-based arch
### W1 --- Significant is not well demonstrated - The proposed idea is a very specific, minor change to SRT -- basically using a slightly different transformer encoder or decoder to replace the original CNN. Fundamentally, I am not fully convinced that it is even crucial to use only transformer architecture than the CNN-based feature extraction and then do the transform. - This small change seems to lead to a large improvement on both object-level and scene-level datasets. However, if we train t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
