WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

Katja Schwarz; Seung Wook Kim; Jun Gao; Sanja Fidler; Andreas Geiger,; Karsten Kreis

arXiv:2311.13570·cs.CV·April 15, 2024·1 cites

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger,, Karsten Kreis

PDF

Open Access 3 Reviews

TL;DR

WildFusion introduces a novel 3D-aware image synthesis method using latent diffusion models in view space, eliminating the need for posed images or canonical representations, and achieves high-quality, 3D-consistent results from in-the-wild data.

Contribution

The paper proposes WildFusion, a 3D-aware latent diffusion model trained without multiview supervision, enabling scalable 3D content creation from unposed, in-the-wild images.

Findings

01

Outperforms recent GAN-based methods in 3D consistency and quality.

02

Learns 3D representations without multiview or pose supervision.

03

Enables novel view synthesis from unstructured image datasets.

Abstract

Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The paper is nicely presented. Charts and tables are nicely made and I found the paper easy to read through. - The two-stage training is interesting. Each step of the pipeline looks reasonable to me. - The training of the method does not require 3D or multi-view image data.

Weaknesses

- An important work is missing in discussion/comparison. "VQ3D: Learning a 3D-Aware Generative Model on ImageNet", ICCV 2023. The two works are very similar and both works adopt a two-stage learning scheme. The major difference is that VQ3D applies a GAN-based method for both stages. - I am not sensitive to the quantitative number in the main paper but I saw many NVS results in the supplementary video are distorted. Also, I did not observe a significant visual improvement over the EG3D. I wo

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- The authors leverages latent diffusion model to address the lack of sample diversity in 3D-aware GAN. - They propose to represent 3D-aware image by an efficient triplane representation. - The training loss avoids the necessity of multi-view images of the same instance, which makes it easier to train on a much larger amount of data. - An extensive ablation study to support design choices.

Weaknesses

- 1. This paper only compare with GAN-based methods. It would be more convincing if a comparison to recent diffusion-based methods (GenVS, IVID, VQ3D) is presented.

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

- Exposure is excellent, the method is exceedingly clear. The overview figure is great. - The paper is well-motivated and the shortcomings of prior work are clearly highlighted. - Design choices are clear. - Baselines are appropriate. - Ablations are detailed and insightful.

Weaknesses

My core complaint with this paper is that I am not quite sure why you would use this method over a simple depth-warping plus inpainting baseline. The generated images are of somewhat low quality - they are certainly far behind anything that can be generated with any SOTA 2D generative model. For any generated image, I could always use the same monocular depth predictor used in this paper to estimate depth, and then warp the image to a novel view. The only challenge would then be holes - which,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis

MethodsDiffusion