MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi; Peng Wang; Jianglong Ye; Mai Long; Kejie Li; Xiao Yang

arXiv:2308.16512·cs.CV·April 19, 2024·72 cites

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang

PDF

Open Access 4 Repos 8 Models 1 Video 3 Reviews

TL;DR

MVDream is a multi-view diffusion model that generates consistent multi-view images from text prompts, combining 2D and 3D learning to improve 3D generation and concept learning.

Contribution

It introduces a novel multi-view diffusion model that unifies 2D and 3D data, enabling improved 3D generation and concept learning from limited examples.

Findings

01

Achieves high consistency in multi-view image generation.

02

Enhances 3D generation stability via Score Distillation Sampling.

03

Learns new concepts from few 2D examples.

Abstract

We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

The paper is well-written and easy to understand. The paper focuses on the consistency problem in using 2D diffusion models for 3D generation, which is neglected in previous methods. The proposed method is among the first to adopt the multi-view image generation setting instead of generating views separately. The experiential result seems promising.

Weaknesses

1. How to ensure consistency across multi-view images? There is no particular 3D prior learned in the proposed method. 2. what does re-using 2D self-attention mean? Does it mean inheriting the weights from a pre-trained model or just adopting the same architecture? 3. The description of the training set should be included in the main paper rather than the appendix. 4. Regarding multi-view consistency, it is also important to train the model only on the 3D dataset to see the effect of the uti

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper proposed a multiview diffusion model. To reuse existing large 2d image datasets, the model is designed to be able to accept both single images and multiview images as input. 1. The advantage of using both 3D and 2D data is also ablated in the appendix. This is an important observation. It also aligns with our conjecture: even with objaverse, the scale of 3D data is still limited. Using 2D datasets has a significant positive effect when training 3D networks. 2. The trained model will b

Weaknesses

1. Did the authors consider rendering the images with large elevation? For example, what if we have camera poses at the top or bottom of the object? Will this give more information to score distillation? 2. The main comparison in Fig. 6 is not fair. All other methods are optimization-based by reusing 2d image diffusion. However, the proposed method used additional datasets. A fair comparison would be zero123+SDS? 3. Following the above comment, the trained model should be able to combine with SJ

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The method shows a distinctive improvement over the state of the art. - It clearly reduces the multi-face janus problem. In several shown examples, it eliminates this problem. - It is selected as better than previous work at an extremely high rate: 78% vs. 4 other methods, combined. - Objects are clearly consistent across 360degree views, and do not suffer from content drift - Exhaustive results upon the project webpage make these conclusions clear Utilizing 3D assets to train a text-to-3D mode

Weaknesses

Summary: the qualitative performance of the methods in this paper are impressive, but the experimental presentation in writing is poor. I think many changes are needed, but the results are quite impressive so I am still positive about this paper. Experimental presentation is poor - What is the point of each experiment? There is no overview section to clarify this. After multiple reads, I think I can infer, but reading was challenging. - It is often very hard to figure out the point of an experi

Code & Models

Repositories

Models

Videos

MVDream: Multi-view Diffusion for 3D Generation· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Retrieval and Classification Techniques

MethodsDiffusion