TL;DR
G-CUT3R is a new method that improves 3D scene reconstruction by integrating camera and depth priors into a feed-forward model, enhancing accuracy while maintaining flexibility across different input types.
Contribution
It introduces a lightweight modification to CUT3R that incorporates multiple prior data sources through dedicated encoders and feature fusion, enabling improved 3D reconstruction performance.
Findings
Significant performance improvements on multiple benchmarks.
Effective utilization of auxiliary depth and camera information.
Flexible integration of various prior modalities during inference.
Abstract
We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility…
Peer Reviews
Decision·Submitted to ICLR 2026
This work incorporates additional priors into feed-forward 3D reconstruction, enhancing its flexibility for diverse application scenarios. Comprehensive experiments demonstrate significant performance gains over state-of-the-art methods.
- The paper claims to be efficient and lightweight. However, no additional results are provided to support this, such as FLOPs or parameter counts compared with CUT3R and Pow3R. Since the method introduces extra encoders and layers for additional modalities, it may incur substantial parameter and computation overhead relative to CUT3R, potentially compromising efficiency. It is unclear whether the reported efficiency stems primarily from inheriting CUT3R’s efficiency rather than being more effic
Technically sound and well-motivated: both CUT and G-Reg directly target known weaknesses of feed-forward 3D reconstruction—cross-view inconsistency and surface noise—and are implemented cleanly. Improved stability and quality: the proposed regularizations yield consistent gains in quantitative metrics and visual quality across diverse datasets. Good empirical rigor: ablations on each component demonstrate that uncertainty alignment and geometric regularization complement each other.
Limited conceptual novelty: both CUT and G-Reg are straightforward extensions of well-known principles—uncertainty calibration and geometric smoothing. The contributions lie more in empirical engineering than in new theoretical or algorithmic insight. Lack of deeper analysis: the paper does not explore why these regularizations help beyond intuitive reasoning; no theoretical justification or failure analysis is offered. Possible over-smoothing: G-Reg may suppress fine details, but no perceptua
- Clear problem framing: Many feed-forward methods ignore readily available priors. Incorporating them is practically important. However, I suggest mentioning DepthSplat [a] as it is also a feed-forward method using depth priors (for a 3DGS reconstruction). - The method is light-weight and builds upon a well-established baseline, Cut3r. - Unified model for arbitrary prior subsets: training with random modality subsets reflects practical scenarios. - Strong experiments. [a] Xu, H., Peng, S., Wa
- It is a bit unclear to me what depth is used for the datasets. Did the authors always use sensor depth? Using anything else would render the results incorrect. I put this in the weaknesses given that this is very important. The paper should specify a detailed, dataset-by-dataset description of depth sources used as priors to make the results understandable. - Same question holds for the camera poses. Do the authors use SLAM-estimated poses (without post-processing) as priors or the GT ones? U
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
