iLRM: An Iterative Large 3D Reconstruction Model
Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

TL;DR
The paper introduces iLRM, an efficient iterative model for 3D reconstruction that overcomes scalability issues of transformer-based methods by decoupling scene representation, reducing attention complexity, and injecting high-res info, achieving superior quality and speed.
Contribution
iLRM presents a novel iterative approach with a two-stage attention scheme and high-resolution injection, enabling scalable and high-quality 3D reconstruction.
Findings
Outperforms existing methods in reconstruction quality.
Achieves faster reconstruction speeds.
Demonstrates effectiveness on RE10K and DL3DV datasets.
Abstract
Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Decoupling representations and staged attention effectively tackle quadratic costs in multi-view processing, enabling more views (e.g., 8 vs. baselines' 2-4) with lower compute/memory. This new way of handling view-camera interaction could be helpful to reduce compute cost of general multi-view transformer models.
* While the concept of iterative refinement is nice and interesting, it is reluctant to say the current model design has a strong connection to the iterative refinement, especially the claimed “feedback-driven refinement” (L93). Since the LRM usually just stacks of attention block processing on the same series of tokens, one can also say that the tokens are “iteratively refined” block by block. The paper fails to convincingly show this decoupled representation enables unique iterative refinement
1. The model introduces an efficient two-stage attention mechanism that breaks the quadratic complexity bottleneck of prior methods. This allows it to effectively process a larger number of views and higher-resolution images without prohibitive computational costs. 2. iLRM reframes reconstruction as an iterative refinement process within a feed-forward network and achieves good reconstruction quality on standard benchmarks.
1. This paper only shows 2D novel view synthesis metrics like PSNR, SSIM, which are all about image quality. However, when it comes to reconstruction, the geometry is also very important. CD, F-score and similar metrics should be included. 2. No mesh reconstruction results. Showing conversion to a mesh would have better showcased the coherence of the underlying geometry and its practical applicability for downstream tasks like gaming or simulation. 3. Lack of comparison with feed-forward reconst
1. This work is well-written, and the description of the methodology section is clear. 2. It addresses an important problem, scalability issues in feed-forward 3DGS reconstruction. Compared to previous works, it effectively compresses the number of Gaussians under dense view inputs.
1. The term 'iterative' in the paper's title is difficult to understand. If I understand correctly, it is more similar to stacking attention blocks, following the scaling law, as shown in Table 7. Additionally, apart from Figure 1, the paper lacks more qualitative validation of 'iterative refinement.' 2. If I understand correctly, the core of iLRM lies in introducing view embeddings as tokens to be updated for reconstructing 3DGS (from Fig2(a) to Fig2(b)), thereby improving the efficiency of at
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis · Image Processing and 3D Reconstruction
