LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou and, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan

TL;DR
LRM is a scalable transformer-based model that rapidly predicts detailed 3D object reconstructions from a single image, trained on extensive datasets for high generalization.
Contribution
It introduces the first large-scale transformer model for single-image 3D reconstruction trained on a million objects, enabling fast and high-quality results.
Findings
Predicts 3D models within 5 seconds from a single image.
Generalizes well to real-world and generative model images.
Trained on 1 million objects from diverse datasets.
Abstract
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative…
Peer Reviews
Decision·ICLR 2024 oral
* To my knowledge, this paper is the first work showing the scaling ability of transformers on novel view synthesis. * The task of single-view novel view synthesis is extremely challenging and well-motivated. * The method is extremely efficient during inference as it only requires a single forward pass, unlike many generative models, e.g. score-based generative models and/or optimization-based methods. * As a non-generative model, it is astounding for me to see its amazing performances. It seems
* The method requires significant computational resources. * A minor issue, but the paper shows no quantitative comparison with any prior work. * Since the method is discriminative, it is not able to sample different realizations of an input. Additionally, the averaging of modes, even though has been weakened a lot compared with prior works such as PixelNeRF, still exists as the author mentioned. * The task of novel view synthesis is inherently probabilistic, as the author mentioned. Even though
The paper presents an interesting system that hallucinates/synthesizes 3D appearance from DINO features. Compared to the recent methods, LRM excels at - Directly producing 3D representations in a single forward pass, instead of running optimization to construct a 3D model for each input instance. - LRM retains details from the input view better, possibly due to the use of image-based features. - LRM does not require canonicalized training objects, making it easier to apply LRM to other datasets.
Although the results look promising, the paper has two main weaknesses: - Insufficient quantitative comparisons: the paper does not conduct any quantitative evaluation against other methods. I believe the novel view synthesis and 3D reconstruction can be evaluated on the held-out sets for those 3D object datasets, and user study should also be possible. Even if the quantitative results may not reflect the generation quality entirely, the paper should include discussions on why these scores are n
This paper is well-written and the results are promising. Although there were papers trying to train a generalizable nerf predictor, this paper proves the possibility of training on a large-scale dataset for the generalizable nerf prediction. To my knowledge, this is the first attempt to train it on scale like Objaverse + MVImgNet. The experiment part is well-organized and sufficient, provide a thorough ablation for different model components.
I don't think there exists any apparent weakness in the paper. Please refer to the following questions part for my other questions regarding the details of paper.
Code & Models
- 🤗stabilityai/TripoSRmodel· 92k dl· ♡ 60492k dl♡ 604
- 🤗camenduru/OpenLRMmodel
- 🤗zxhezexin/openlrm-small-obj-1.0model· 20 dl· ♡ 720 dl♡ 7
- 🤗zxhezexin/openlrm-large-obj-1.0model· 15 dl· ♡ 615 dl♡ 6
- 🤗zxhezexin/openlrm-base-obj-1.0model· 20 dl· ♡ 1320 dl♡ 13
- 🤗zxhezexin/openlrm-obj-small-1.1model· 52 dl· ♡ 152 dl♡ 1
- 🤗zxhezexin/openlrm-obj-base-1.1model· 38 dl· ♡ 238 dl♡ 2
- 🤗zxhezexin/openlrm-obj-large-1.1model· 14 dl· ♡ 114 dl♡ 1
- 🤗zxhezexin/openlrm-mix-large-1.1model· 259 dl· ♡ 6259 dl♡ 6
- 🤗zxhezexin/openlrm-mix-base-1.1model· 470 dl· ♡ 6470 dl♡ 6
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
