Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu

TL;DR
Long-LRM is a fast, feed-forward 3D Gaussian reconstruction model capable of instant, high-resolution, wide-coverage scene reconstruction from multiple images, achieving comparable quality to optimization methods with significantly higher efficiency.
Contribution
It introduces a novel long-sequence transformer architecture with token merging and pruning for efficient large-scale 3D scene reconstruction from multiple images.
Findings
Achieves 800x speedup over optimization-based methods.
Handles input sizes at least 60x larger than previous feed-forward models.
Maintains comparable reconstruction quality to optimization approaches.
Abstract
We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360{\deg} wide-coverage, scene-level reconstruction. Specifically, it takes in 32 input images at a resolution of 960x540 and produces the Gaussian reconstruction in just 1 second on a single A100 GPU. To handle the long sequence of 250K tokens brought by the large input size, Long-LRM features a mixture of the recent Mamba2 blocks and the classical transformer blocks, enhanced by a light-weight token merging module and Gaussian pruning steps that balance between quality and efficiency. We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods while achieving an 800x speedup w.r.t. the optimization-based approaches and an input size at least 60x larger than the previous feed-forward…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Enhance feed-forward scene reconstruction methods, eg, GS-LRM to more input views. 2. The usage of hyrbid network of Mamba and transformer is reasonable for handling extreme long-sequence tokens, though it is not the first paper in this field that introduce Mamba. 3. Practical solutions for memory efficiency through token merging and Gaussian pruning, enabling scaling to high resolutions (960x540) where other variants fail. 4. The ablation study is comprehensive, well demonstrating the eff
1. Lack of novelty. The core contribution of this paper seems a combination of GS-LRM and Hamba, Gamba and MVGamba. 2. The lack of discussion on the above Mamba-based 3D reconstruction models, which have been publicly available more than half years, is not acceptable. 3. While this paper presents several practical innovations in memory optimization, it may be more suitable for computer vision conferences rather than ICLR.
1. Extends the application of feed-forward 3D scene reconstruction to longer-range inputs. 2. Sound network architecture design by combining transformers and Mamba2 to process long token sequences. 3. Applies a token merging module to reduce computational overhead for processing long-range input views. 4. The author provides justification for using Mamba in Table 2, although the comparison with GS-LRM is somewhat unfair.
1. Insufficient justification for using Mamba. GS-LRM claims it can accept arbitrary input view numbers by downsampling images with large patch sizes to shorten the overall token length for global attention. The features after attention can then be upsampled to predict a large number of Gaussians. However, in Table 2, the authors provide the same patch size for both the 7M1T and GS-LRM architectures, leading to an unfair comparison. 2. Why not cost volumes and abadon 3D inductive biases. Althou
1.Introducing a method able to infer the 3DGS for wide-coverage scenes. 2.Utilize Mamba2 architecture to model the long token relations.
1.The comparison is not enough. Only compared with naive 3DGS. There are recent 3DGS/NeRF variants designed for large scale scene modeling: Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields, Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering, Mip-Splatting: Alias-free 3D Gaussian Splatting. 2.Despite of the inference speed, it shows in the videos the floaters appear without further regularizations. 3.They main contribution is to use Mamba2 for long sequence modeling, which
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques
MethodsPruning
