WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space
Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger,, Karsten Kreis

TL;DR
WildFusion introduces a novel 3D-aware image synthesis method using latent diffusion models in view space, eliminating the need for posed images or canonical representations, and achieves high-quality, 3D-consistent results from in-the-wild data.
Contribution
The paper proposes WildFusion, a 3D-aware latent diffusion model trained without multiview supervision, enabling scalable 3D content creation from unposed, in-the-wild images.
Findings
Outperforms recent GAN-based methods in 3D consistency and quality.
Learns 3D representations without multiview or pose supervision.
Enables novel view synthesis from unstructured image datasets.
Abstract
Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel…
Peer Reviews
Decision·ICLR 2024 poster
- The paper is nicely presented. Charts and tables are nicely made and I found the paper easy to read through. - The two-stage training is interesting. Each step of the pipeline looks reasonable to me. - The training of the method does not require 3D or multi-view image data.
- An important work is missing in discussion/comparison. "VQ3D: Learning a 3D-Aware Generative Model on ImageNet", ICCV 2023. The two works are very similar and both works adopt a two-stage learning scheme. The major difference is that VQ3D applies a GAN-based method for both stages. - I am not sensitive to the quantitative number in the main paper but I saw many NVS results in the supplementary video are distorted. Also, I did not observe a significant visual improvement over the EG3D. I wo
- The authors leverages latent diffusion model to address the lack of sample diversity in 3D-aware GAN. - They propose to represent 3D-aware image by an efficient triplane representation. - The training loss avoids the necessity of multi-view images of the same instance, which makes it easier to train on a much larger amount of data. - An extensive ablation study to support design choices.
- 1. This paper only compare with GAN-based methods. It would be more convincing if a comparison to recent diffusion-based methods (GenVS, IVID, VQ3D) is presented.
- Exposure is excellent, the method is exceedingly clear. The overview figure is great. - The paper is well-motivated and the shortcomings of prior work are clearly highlighted. - Design choices are clear. - Baselines are appropriate. - Ablations are detailed and insightful.
My core complaint with this paper is that I am not quite sure why you would use this method over a simple depth-warping plus inpainting baseline. The generated images are of somewhat low quality - they are certainly far behind anything that can be generated with any SOTA 2D generative model. For any generated image, I could always use the same monocular depth predictor used in this paper to estimate depth, and then warp the image to a novel view. The only challenge would then be holes - which,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
MethodsDiffusion
