TL;DR
ReconViaGen introduces a novel framework that combines reconstruction priors with diffusion-based 3D generative models to improve the accuracy and consistency of multi-view 3D object reconstruction, addressing previous limitations.
Contribution
The paper proposes ReconViaGen, a new method that effectively integrates reconstruction priors into diffusion-based 3D generation, enhancing multi-view reconstruction accuracy and consistency.
Findings
Achieves more complete and accurate 3D reconstructions.
Improves consistency between generated details and input views.
Demonstrates superior performance over existing methods.
Abstract
Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper provides clear and detailed descriptions of the methodology and experimental design, making it easy to understand and replicate. - The proposed ReconViaGen framework effectively addresses the limitations of pure reconstruction and pure generative methods by combining their strengths. The coarse-to-fine strategy, utilizing Global Geometric Conditions (GGC) for coarse structure and Per-View Conditions (PVC) for fine details, allows for a more accurate and consistent 3D reconstruction
- The requirements for multi-view inputs may limit the applicability of the method in scenarios where only single-view inputs are available. In practical applications, users often can only provide a single view. It is unclear how the method would perform with single-view inputs, and the paper does not discuss or analyze this aspect. - In Section 4.3, the ablation study does not provide visual comparisons to illustrate the effects of different modules, relying solely on numerical results to demon
1. Injects reconstruction priors (VGGT) into a 3D diffusion generator, addressing a well-known failure mode of existing generative methods—global structure drift & local inconsistency. 2. The clear division of labor and strong motivation: GGC for structural cues and PVC for fine details match information granularity. 3. RVC corrects the diffusion velocity field with differentiable rendering feedback at inference, simple to implement yet clearly improves input consistency. 4. Broad evaluation wit
1. While results support “GGC is more suitable for SS and PVC is more suitable for SLAT,” deeper interpretability is missing. 2. RVC hyperparameter/stability analysis is light: sensitivity to α, timestep threshold (t < 0.5), and the outlier rejection threshold (0.8) is not quantified; inference-time overhead (decoding/rendering per step) is not reported. 3. Pose refinement pipeline is engineering-heavy. 4. RVC is inference-only, creating a potential train–inference gap. 5. Some theoretical clari
1. The authors combine two very strong recent works VGGT and TRELLIS with a clean Cross-Attention module into an E2E pipeline which yields decent improvements in the results. 2. The paper is presented in a clear and lucid manner.
The paper however raises the following concerns: **Major Concerns**: 1. The RVC step contributes the most amount of visual details to the generated results, and as such, this step lacks sufficient details and discussion in the paper. Most importantly, how does an RVC only baseline fare against the `(d)-full` version of the method. For the Rendering aware correction, something simple as the predicted point-clouds of VGGT or even sparse reconstructions from COLMAP are meaningful because in this v
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
