Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Lu Sheng

TL;DR
Ouroboros3D introduces a unified recursive diffusion framework that jointly generates multi-view images and reconstructs 3D models, enhancing geometric consistency and overcoming data bias issues of traditional two-stage methods.
Contribution
This work presents the first integrated diffusion-based approach for simultaneous multi-view image generation and 3D reconstruction with self-conditioning.
Findings
Outperforms two-stage and inference-only methods in quality.
Improves geometric consistency in 3D reconstructions.
Demonstrates robustness through joint training and feedback.
Abstract
Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The general motivation of doing joint training of the diffusion and reconstruction models - it might fill the domain gap between the two separately trained models (such as in LGM). 2. The technical design of the RGB and CCM feedback - it keeps feeding the latest-reconstructed color and geometry info to each step of diffusion generation - thus is able to generate finer and more consistent images for later optimization.
1. Given the proposed CCM and RGB feedback as major claimed contribution of where the improvements come from, I wonder why even without CCM and RGB feedback (table 2), the scores (PSNR, SSIM, LPIPS) are still outperform those in the baselines (table 1)? Given the PSNR “21.761” appears in both tables, I think they use the same test set. Without CCM and RGB, the proposed method degrades to a regular one, then where do the improvements over the baselines come from? Please explain if any other prop
The idea of incorporating feedback from the reconstructed 3D model into the denoising loop is well-founded, as seem to enhance multi-view consistency and improve overall image quality. The displayed results show improvements over prior work, quantitatively. In particular, the video demonstrating the recursive refinement process effectively illustrates the model's capability to incrementally enhance the reconstruction quality. The paper is well-written, clearly structured, and easy to follow.
1. The qualitative enhancements over LGM and CLAY are not apparent. A more detailed comparative analysis, including side-by-side visualizations, would help clarify the specific advancements introduced by the proposed method. 2. The reconstruction model's ability to reconstruct a 3D Gaussian Splatting from both slightly and highly noisy images is counterintuitive. An explanation of the model's robustness to varying noise levels would be beneficial. 3. The proposed approach utilizes 3D Gaussian
* The paper proposes a novel framework for 3D generation by introducing 3D feedback through a process that denoise->reconstruct->render->condition->denoise… . The performance looks good. * The idea of the paper that involves 3D feedback mechanism into multi-view generation is reasonable * The paper is well-written and easy to follow.
* The novelty of this paper is limited. The overall pipeline is very similar to existing works such as VideoMV and SyncDreamer. Integrating a large gaussian reconstruction model into the denoising process of video diffusion model has been proposed by VideoMV and the 3D feedback condition has been proposed by SyncDreamer, except that SyncDreamer adopts depth map while this method adopts RGB and CCM. However, these are very little new contribution for the fast-developing 3D generation research com
1. The paper is very readable. 2. STOA results with incremental improvements over recent prior work are presented.
Following the emergence of powerful diffusion models in generating or hallucinating highly realistic images 1-2 years ago, many researchers have observed that while the generated images look plausible, they lack 3D consistency and thus we have seen various approaches in recent computer vision and machine learning venues on optimizing image generation and 3D geometric consistency in tandem.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Medical Image Segmentation Techniques · Image Processing Techniques and Applications
MethodsDiffusion
