ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan

TL;DR
ReconX leverages large pre-trained video diffusion models to reconstruct 3D scenes from sparse views by synthesizing consistent video frames guided by a 3D structure condition, improving quality and generalizability.
Contribution
It introduces a novel approach that uses video diffusion models and 3D structure encoding to enhance sparse-view 3D scene reconstruction.
Findings
Outperforms state-of-the-art methods in quality.
Ensures high 3D scene consistency.
Effective with limited input views.
Abstract
Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper tackles an important and very challenging task of reconstructing general scenes from very sparse views. - The two-stage approach leverages the strong prior of video diffusion models and the technical contributions fit well into this framework. - The experimental evaluation shows strong results consistently outperforming state-of-the-art baselines on both - in training distribution data - significant quantitative and qualitative advantage for small angle variance in input vie
- The lack of precise mathematical formulation raises doubt about proposition 1 and its proof: - The main inequality in equation 17 (appendix) is justified only verbally and is not obvious to me. - A counterexample for the inequality in line 841 could be a dataset mainly consisting of dark rooms with all ground truth renderings being black except for one where the lights are on (with possibly very complex geometry or even transparent and reflective materials). For this case, fitting the marg
1. The proposed 3D point cloud conditioning for ensuring the 3D consistency of generated frames is novel and intuitive. 2. The video diffusion architecture is well-ablated, and each of the highlighted contributions impacts the final reconstruction quality positively. 3. Extensive experiments demonstrate ReconX's ability to achieve high-quality reconstructions that outperform related state-of-the-art methods, particularly in challenging scenarios where there is a large angle variance between inpu
1. Weak Benchmarks: ReconX shows strong sparse-view reconstruction capabilities for scenes from LLFF and DTU datasets, but these are relatively simpler benchmarks, and since the method uses strong video diffusion priors, I would expect a comparison on more challenging benchmarks like MipNeRF360 (or Tanks and Temples). CAT3D / ReconFusion already provides data splits for 3, 6, and 9 view settings for this dataset, so comparisons with a generalized N-view setting would strengthen the submission fu
There does not appear to be previous work combining ideas from Dust3r, video diffusion and 3DGS, so this work is novel in that sense, although there are concurrent ones (ViewCrafter [1], LM-Gaussian [2], 3DGS-Enhancer [3], MVSplat360 [4]) that explore similar ideas. The proposed methodology appears to be sound, and mostly well-justified. Although this perhaps explains convergent ideas as highlighted by the concurrent works. Most of the paper is mostly well-written and diagrams are clear, excep
**Problems with experimental settings** The primary weakness of this paper lies in the experimental section. The baseline methods chosen seem to be the wrong ones to compare against, as they are all _feed-forward_ methods that produce 3DGS representations (pixelSplat, MVSplat) or radiance fields (MuRF, pixelNeRF, GPNR), done _in a single pass_. In contrast, the proposed method is based on per-scene 3DGS optimization, which unsurprisingly will perform better. Furthermore, the choice of datasets
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques
MethodsDiffusion
