ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Fangfu Liu; Wenqiang Sun; Hanyang Wang; Yikai Wang; Haowen Sun; Junliang Ye; Jun Zhang; Yueqi Duan

arXiv:2408.16767·cs.CV·June 26, 2025·2 cites

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan

PDF

Open Access 3 Reviews

TL;DR

ReconX leverages large pre-trained video diffusion models to reconstruct 3D scenes from sparse views by synthesizing consistent video frames guided by a 3D structure condition, improving quality and generalizability.

Contribution

It introduces a novel approach that uses video diffusion models and 3D structure encoding to enhance sparse-view 3D scene reconstruction.

Findings

01

Outperforms state-of-the-art methods in quality.

02

Ensures high 3D scene consistency.

03

Effective with limited input views.

Abstract

Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 5

Strengths

- The paper tackles an important and very challenging task of reconstructing general scenes from very sparse views. - The two-stage approach leverages the strong prior of video diffusion models and the technical contributions fit well into this framework. - The experimental evaluation shows strong results consistently outperforming state-of-the-art baselines on both - in training distribution data - significant quantitative and qualitative advantage for small angle variance in input vie

Weaknesses

- The lack of precise mathematical formulation raises doubt about proposition 1 and its proof: - The main inequality in equation 17 (appendix) is justified only verbally and is not obvious to me. - A counterexample for the inequality in line 841 could be a dataset mainly consisting of dark rooms with all ground truth renderings being black except for one where the lights are on (with possibly very complex geometry or even transparent and reflective materials). For this case, fitting the marg

Reviewer 02Rating 5Confidence 5

Strengths

1. The proposed 3D point cloud conditioning for ensuring the 3D consistency of generated frames is novel and intuitive. 2. The video diffusion architecture is well-ablated, and each of the highlighted contributions impacts the final reconstruction quality positively. 3. Extensive experiments demonstrate ReconX's ability to achieve high-quality reconstructions that outperform related state-of-the-art methods, particularly in challenging scenarios where there is a large angle variance between inpu

Weaknesses

1. Weak Benchmarks: ReconX shows strong sparse-view reconstruction capabilities for scenes from LLFF and DTU datasets, but these are relatively simpler benchmarks, and since the method uses strong video diffusion priors, I would expect a comparison on more challenging benchmarks like MipNeRF360 (or Tanks and Temples). CAT3D / ReconFusion already provides data splits for 3, 6, and 9 view settings for this dataset, so comparisons with a generalized N-view setting would strengthen the submission fu

Reviewer 03Rating 5Confidence 4

Strengths

There does not appear to be previous work combining ideas from Dust3r, video diffusion and 3DGS, so this work is novel in that sense, although there are concurrent ones (ViewCrafter [1], LM-Gaussian [2], 3DGS-Enhancer [3], MVSplat360 [4]) that explore similar ideas. The proposed methodology appears to be sound, and mostly well-justified. Although this perhaps explains convergent ideas as highlighted by the concurrent works. Most of the paper is mostly well-written and diagrams are clear, excep

Weaknesses

**Problems with experimental settings** The primary weakness of this paper lies in the experimental section. The baseline methods chosen seem to be the wrong ones to compare against, as they are all _feed-forward_ methods that produce 3DGS representations (pixelSplat, MVSplat) or radiance fields (MuRF, pixelNeRF, GPNR), done _in a single pass_. In contrast, the proposed method is based on per-scene 3DGS optimization, which unsurprisingly will perform better. Furthermore, the choice of datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques

MethodsDiffusion