Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention
Xiaosong Jia, Yihang Sun, Junqi You, Songbur Wong, Zichen Zou, Junchi Yan, Zuxuan Wu, and Yu-Gang Jiang

TL;DR
Efficient-LVSM introduces a dual-stream transformer architecture for large view synthesis that reduces computational complexity, improves speed, and enhances performance over previous models like LVSM.
Contribution
It proposes a decoupled co-refinement attention mechanism that improves efficiency and generalization in large view synthesis models.
Findings
Achieves 29.86 dB PSNR on RealEstate10K with 2 views, surpassing LVSM.
Offers 2x faster training convergence and 4.4x faster inference.
State-of-the-art performance on multiple benchmarks with strong zero-shot generalization.
Abstract
Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong…
Peer Reviews
Decision·ICLR 2026 Poster
S1. Decomposing the full self-attention into different modules with cross-attention makes sense, as has been explored in various domains to design more efficient neural network architectures. S2. The experiments show significant improvements in novel view synthesis, while also enhancing inference efficiency. S3. The modified architectures can incorporate REPA to effectively train the hidden representations of input views.
W1. The technical contribution is limited. Instead of using full self-attention across input and target views, employing cross-attention is a typical design choice for improving training and inference efficiency [NewRef-1]. W2. Despite the performance improvements, Efficient-LVSM cannot address the fundamental limitations of LVSM. For example, its architecture cannot account for the alignments either between generated target views or within the input views. Therefore, the overall impact of this
- The paper provides a systematic analysis of LVSM’s inefficiencies and derives a principled redesign via a decoupled encoder-decoder. The KV-cache design enabling incremental inference is a noteworthy contribution for real-time or interactive view synthesis, rarely explored in prior feedforward NVS models. - Efficient-LVSM achieves state-of-the-art reconstruction quality on both scene-level (RealEstate10K) and object-level (GSO/ABO) benchmarks. The reported 0.9 dB PSNR gain, 4× inference speed
- The dual-stream co-refinement design is highly similar in spirit to the MM-DiT block in terms of architecture introduced by Stable Diffusion 3 (2024). The authors are encouraged to cite MM-DiT and clarify how Efficient-LVSM extends this pattern to the feedforward NVS setting.
1. Clearly justified and very reasonable architectureal change. The full-attention used by original LVSM restricts its effiency in lot of usecases. And when comparing with the encoder-decoder version of LVSM, the author identified the major problem: we need to use key-value cache of all layers! 2. Strong empirical results. The experiment results on Objverse and Rel10K are quite strong, with much better rendering quality and less training time. 3. The author shows a very interesting stud
I do have a few comments. I think the authors should list more training details for their methods and their ablation experiments. The batch size, and total number of training iterations. I think it's missed. For training batch sizes, there are some tiny but important details, the original LVSM need to repeat the batch to make sure that each pass of the model only contains one target view, and this is one of the core-reason that the original decoder only LVSM is expensive in training and infer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Data Compression Techniques · Video Coding and Compression Technologies
