HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu,, Long Quan, Ying Shan, Yonghong Tian

TL;DR
HiFi-123 advances 3D content generation from a single image by enhancing fidelity and multi-view consistency through novel reference-guided techniques, achieving state-of-the-art results in zero-shot view synthesis.
Contribution
The paper introduces RGNV and RGSD, two novel reference-guided techniques that significantly improve 3D generation quality from a single image.
Findings
Enhanced fidelity in 3D generation results
Improved multi-view consistency
Achieved state-of-the-art performance
Abstract
Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The method seems effective. Both qualitative and quantitative experiments are conducted to demonstrate the effectiveness of the proposed method 2. The presentation of the method is clear and easy to understand 3. The paper is mostly self-contained. Relavent prior works are cited and necessary preliminary concepts are introduced
The quantitative experiments are relatively weak for several reasons: a. lack of evaluation of 3D generated geometry: the proposed method is claimed to enhance the fidelity of 3D generation. However, there's no quantitative evaluation on 3D geometry or texture. There are plenty of datasets such as GSO, RTMV, CO3D where such evaluation can be done in a standardized manner. I think this will be necessary in showing the effectiveness of the approach. b. lack of evaluation of 3D consistency: one o
* The use of DDIM inversion + attention injection, in the context of image-to-3D, is not only quite novel, but also performing very well -- its effects are ablated in the appendix. * The proposed method produces 3D shapes that are significantly more consistent with the reference view while having better visual quality when looked from unseen directions. This can be seen both numerically from the overall better evaluation metrics, as well as empirically from visual results. * The method does not
* The key technique used in the reference-guided novel view enhancement method proposed in the paper is not completely new. It has been incorporated in diffusion-based video generation [Wu et al. 2022, Qi et al. 2023] and image editing [Cao et al. 2023]. * The majority of the contributions of the paper focuse on the "fine" stage, while the "coarse" stage still relies on SDS and reconstruction loss. The fine stage will likely not be able to recover from the mistakes in the coarse stage. It will a
* The paper is well written and easy to follow; * The proposed method improves the performance of image-to-3D content creation compared to prior methods;
The ablation study is not sufficient: * The extent to which the improvement is attributable to the depth conditioned stable diffusion model or the attention injection remains unclear; * The extent to which the novel view enhancement pipeline will be affected by the quality of the rendered coarse view depth map remains unclear; * Can you also present the result of the generated 3D content after the coarse stage, so that we can see the improvements by the refine stage? Is it in Figure 5, i.e. t
The illustration is generally clear, although some sections could be further improved. Both the visual and quantitative results surpass those of the proposed baselines.
Mistakes: 1) In Fig.1, is the inconsistency between the depth map and the novel view image in the bottom line a mistake? Results: 1) More diverse rendering results are expected such as buildings and human bodies, as are in other prior works. 2) Flickering issues are observable in the videos. An explanation or analysis is expected. Evaluations: 1) I doubt about the effectiveness of reference view reconstruction for evaluation. One may achieve excellent results on the reference view but l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
