TL;DR
Surf3R is a fast, end-to-end method for 3D surface reconstruction from sparse RGB views that does not require camera calibration and completes scenes in under 10 seconds.
Contribution
It introduces a novel multi-view decoding architecture and a D-Normal regularizer, enabling rapid, pose-free 3D reconstruction with improved accuracy.
Findings
Achieves state-of-the-art results on ScanNet++ and Replica datasets.
Reconstructs scenes in under 10 seconds without camera pose estimation.
Demonstrates strong generalization and high surface detail accuracy.
Abstract
Current multi-view 3D reconstruction methods rely on accurate camera calibration and pose estimation, requiring complex and time-intensive pre-processing that hinders their practical deployment. To address this challenge, we introduce Surf3R, an end-to-end feedforward approach that reconstructs 3D surfaces from sparse views without estimating camera poses and completes an entire scene in under 10 seconds. Our method employs a multi-branch and multi-view decoding architecture in which multiple reference views jointly guide the reconstruction process. Through the proposed branch-wise processing, cross-view attention, and inter-branch fusion, the model effectively captures complementary geometric cues without requiring camera calibration. Moreover, we introduce a D-Normal regularizer based on an explicit 3D Gaussian representation for surface reconstruction. It couples surface normals with…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper addresses a highly relevant problem. Achieving end-to-end 3D reconstruction without camera pose estimation is significant, as traditional pipelines that rely on Structure-from-Motion (SfM) for pose estimation are computationally expensive and often fragile under sparse inputs. * The motivation of the paper is clear, and the paper is well-written and the proposed method is easy to follow and the overall presentation is clear. * Experimental results show that this method can achiev
* While the proposed multi-branch design and cross-reference fusion blocks appear effective for large-scale scene-level reconstruction from sparse views, the overall architectural concept feels relatively straightforward. Similar design principles though applied in different contexts have been explored in prior works [1,2]. * The combination of FR and CRF blocks, built on a multi-branch design, appears effective for representing scenes and facilitating feature communication across views. Howeve
- The paper addresses a challenging and interesting problem: Feed-forward 3D surface reconstruction from sparse and unposed RGB images. - To the best of my knowledge, it is the first work in the line of DUSt3R follow-ups that focuses on direct watertight surface reconstruction. - The qualitative and quantitative evaluation shows advantages over the chosen baselines. - The ablation studies regarding use of 3D Gaussians instead of just point maps and the use of the Depth-Normal regularization st
- The authors blatantly sell ideas of two existing papers as their own contributions, while citing only one of these two papers insufficiently and without clearly stating what is their contribution and what not: - The architecture is 1:1 copied from MV-DUSt3R [1] without any citation of this work in this paper. - This paper proposes Feature-Refine (FR) blocks that correspond to the DecBlocks in MV-DUSt3R and the paper also shares almost the same notation (see equation 1 this paper vs equat
1. The writing is easy to follow 2. The proposed pipeline is efficient and achieves superior results compared to Dust3R and NeuralRecon
Major: 1. The paper claims to propose a feed-forward approach for 3D surface reconstruction without camera calibration as one of its main contributions. However, this idea appears conceptually similar to Dust3R [1], VGGT [2], and their subsequent works. Moreover, the proposed Depth–Normal Regularization seems to originate from another existing paper [Chen et al. (2024b)]. Therefore, the novelty of this work is unclear. I encourage the authors to clarify how their approach differs fundamentally f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
