3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance
Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin

TL;DR
This paper presents a novel method that combines 3D-free and 3D-based approaches to generate high-quality, camera-controlled novel views from a single image, effectively handling complex scenes without extensive training data.
Contribution
It introduces a new approach that leverages pretrained diffusion models and enriches CLIP with 3D camera info for versatile, high-fidelity view synthesis from one image.
Findings
Outperforms existing models in qualitative and quantitative evaluations.
Achieves high-fidelity, consistent novel views at specified camera angles.
Handles complex scenes without extensive training or additional 3D data.
Abstract
Recent 3D novel view synthesis (NVS) methods often require extensive 3D data for training, and also typically lack generalization beyond the training distribution. Moreover, they tend to be object centric and struggle with complex and intricate scenes. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view…
Peer Reviews
Decision·Submitted to ICLR 2025
- The proposed approach effectively integrate 3D-free methods and pretrained 3D-based prior to achieve viewpoint-controlled novel-view synthesis. This technique can generalize to complex scenes without needing large 3D training datasets. - The mutual information guidance improves fidelity of the generated images over the pseudo guidance. - The method is tested both qualitatively and quantitatively, showcasing superior performance compared to state-of-the-art models, especially in handling back
- While the authors acknowledge the limitation of inference-time optimization in terms of real-time applicability and scalability, a direct runtime comparison with baseline models would provide additional clarity. - This method is incapable of generating arbitrary camera viewpoints. this might be due to the choice of the 3D prior, Zero123++, which can provide guidance images at six fixed viewpoints. Clarifying this limitation or demonstrating the capability of generating arbitrary viewpoints is
1. The idea of combining priors from 2D image generative model and object-level 3D generative model is interesting and can be considered as an effective way to leverage different data source for 3D understanding. I think this idea in general is very important. 2. NVS results on scene images are quite impressive considering no additional training is involved.
I think this paper's main weakness is on how the authors present the whole framework in a more motivated way. I'd briefly state this weakness here and leave the others in the following questions section. 1. The connection between the section3 and section4 is very confusing. First, the detailed experiment setting in section3 is not clearly explained (see questions), then it suddenly shift from section3 with CLIP embeddings to section4 on leveraging a diffusion model. I suggest authors revise the
1. The paper investigates an approach that sequentially optimizes the text embeddings and LoRA layers of a diffusion model to guide the novel view synthesis process.
1. The proposed method relies on a pre-trained multi-view diffusion model, Zero123++, and thus shares its limitations: restricted view generation (Zero123++ generates six fixed-view images around the object), modification of background and object, handling multiple objects, and pose misalignment. 2. The evaluation can be seem unfair. For apple to apple comparison, the paper could compare with running the proposed method on Zero123++. Also, measure the CLIP score seem unfair as the proposed metho
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Optical Imaging Technologies
MethodsContrastive Language-Image Pre-training · Diffusion
