Denoising Diffusion via Image-Based Rendering
Titas Anciukevi\v{c}ius, Fabian Manhardt, Federico Tombari, Paul, Henderson

TL;DR
This paper introduces a novel diffusion-based method for detailed 3D scene reconstruction and generation from 2D images, utilizing a new neural scene representation and a unified framework that outperforms existing approaches.
Contribution
The work presents IB-planes for efficient large-scale 3D scene representation, a diffusion model trained on 2D images without extra supervision, and a strategy to prevent trivial solutions in 3D reconstruction.
Findings
Superior results in 3D generation and view synthesis
Effective reconstruction of real-world and synthetic scenes
Unification of 3D reconstruction and generation in a single model
Abstract
Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes,…
Peer Reviews
Decision·ICLR 2024 poster
1. This method achieves state-of-the-art performance in both 3D reconstruction and unconditional generation tasks. Notably, in Figure 2, impressive scene-level 3D generation results are observed, surpassing the capabilities of prior works. 2. The designed method is practical as it can accommodate arbitrary numbers of input views, making it versatile and applicable in various scenarios. 3. The methodology section is characterized by its clarity and accessibility, facilitating a comprehensive unde
1. No qualitative comparison was made with baselines, particularly VSD. While this paper includes numerous quantitative comparisons with three baselines, the figures only display the results of this paper in recontruction and generation task. The visual quality of VSD is also good, and it would be valuable to include its results for comparison. 2. No quantitative and qualitative comparison was made with VSD in the task of single view reconstruction. 3. The experiments section lacks clarity, as t
This paper presents a diffusion-based model for handling the novel-view generation problem given several inputs of viewpoints. This model is able to generate new content in areas where the given viewpoints are not covered.
I have two main concerns about this paper: - 1. After reviewing the supplementary material, I noticed that the generated images from different viewpoints don't seem very consistent. The videos display noticeable bouncing. Have the authors conducted both qualitative and quantitative assessments to verify if the proposed representation approach truly maintains consistency among the outputs of the diffusion model for the same 3D scene? - 2. The paper presentation could be enhanced. While Figure 1
+ For the context of usual multi-view 3D reconstruction, in which the camera views are not structured (i.e., allowing free movement), this paper may be the first attempt to use modern image generation methods like diffusion models. + The performance of the proposed method outperforms SOTA diffusion-based 3D reconstruction. + The simple regularization used in this work, dropping out the features from a view when rendering to that same viewpoint, would be helpful in a broader context.
- There are some unclear technical details and contributions (see detailed comments). - As a single-view 3D reconstruction method, I agree with the practical value of using diffusion models or similar stuff. However, the proposed method basically targets the multi-view contexts. For multi-view 3D reconstruction, usual MVS-based methods (or neural-MVS-based methods, perhaps) may still achieve much better performance for large-scale and detailed reconstruction. A fundamental drawback of using mu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques
MethodsDiffusion
