Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention
Mengfei Li, Xiaoxiao Long, Yixun Liang, Weiyu Li, Yuan Liu, Peng Li,, Wenhan Luo, Wenping Wang, Yike Guo

TL;DR
This paper introduces M-LRM, a multi-view 3D reconstruction model that leverages geometry-aware encoding and attention mechanisms to produce high-fidelity 3D shapes more efficiently than previous methods.
Contribution
The paper proposes a novel multi-view consistent cross-attention scheme and uses 3D priors for initialization, improving reconstruction quality and training speed.
Findings
Achieves higher fidelity 3D reconstructions.
Demonstrates faster convergence in training.
Outperforms previous multi-view reconstruction methods.
Abstract
Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the triplane tokens. Compared to previous methods, the proposed M-LRM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Medical Imaging Techniques and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
