Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Zeyi Sun; Tong Wu; Pan Zhang; Yuhang Zang; Xiaoyi Dong; Yuanjun Xiong,; Dahua Lin; Jiaqi Wang

arXiv:2406.00093·cs.CV·October 4, 2024

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong,, Dahua Lin, Jiaqi Wang

PDF

Open Access 3 Models 1 Datasets 4 Reviews

TL;DR

Bootstrap3D introduces a synthetic data generation framework that enhances multi-view diffusion models for 3D content creation by producing high-quality multi-view images with captions, improving image quality and prompt-following.

Contribution

The paper presents a novel pipeline combining diffusion models and a 3D-aware filtering model to generate large-scale high-quality multi-view data for training diffusion models.

Findings

01

Generated 1 million synthetic multi-view images with captions.

02

Achieved improved image quality and view consistency.

03

Enhanced prompt-following ability in 3D diffusion models.

Abstract

Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. The motivation is clear that the current limitation for multi-view model , compared to single image model, is the data. Using existing text-to-multiview model to generate more data, and use them for training make sense. 2. The way of filtering mv data is important. As the synthetic data might not be multiview consistent, getting a good set of data is important. The idea of train a MV-LLaVA is interesting. 3. The use of synthetic data is also important. Even after filtering, the data is still

Weaknesses

1. Lack of multi-view consistency evaluation. Although there is result using LRM etc, direct measuring of view consistency is not explicit. One way is to use LRM to get a 3D asset, then use reprojection error.

Reviewer 02Rating 3Confidence 4

Strengths

1. The idea of bootstrapping data from existing multi-view generation models to train a new multi-view generation is novel and interesting. This is similar to the pseudo-labeling technique in semi-supervised learning. 2. The qualitative results shown in the paper are good compared to other existing methods. 3. The presentation of the pipeline is detailed and clear with helpful illustration and diagrams.

Weaknesses

1. In section 4 the paper compares the proposed method to existing methods like Instant3D and MVDream, but the diffusion backbone of these methods are different. The proposed method uses PixArt-Alpha, while Instant3D and MVDream use SDXL. It is unclear whether the authors re-implement Instant3D and MVDream with the PixArt-Alpha backbone. The comparison is unfair if we use different backbones, because we do not know if the performance gain is due to a better training scheme or just a more powerfu

Reviewer 03Rating 3Confidence 5

Strengths

## Stregthness 1. **Motivation is good.** This paper found that the current 3D datasets are low-quality and they research how to generate high-quality datasets for subsequent applications. 2. **Writting is good.** This paper is written well and easy to follow. 3. **High-volume dataset.** This paper generates 1 million multi-view images with dense descriptive captions suitable for training the multi-view diffusion model.

Weaknesses

## Weakness 1. **Purpose Ambiguity.** I understand authors want to generate high-quality multi-view images. However, in line 016, the authors claim that "they propose Bootstrap3D to generate an arbitrary quantity of multi-view images to assist in training multi-view diffusion models". There is ambiguity I am confused. If the proposed Bootstrap3D can generate multi-view images why cannot use it to directly generate an arbitrary number of views of images or reconstruct 3D models while training a m

Reviewer 04Rating 3Confidence 4

Strengths

* The proposed data generation pipeline is practical and addresses significant challenges in the field of 3D generative models. * The paper clearly describes the challenges of building a data generation pipeline and provides well-documented implementation details that address these challenges.

Weaknesses

**Weakness 1: Clarification about contribution** I think this paper well explains the data generation pipeline, however, additional components such as TTR and MV-LLaVA make it difficult to fully understand the contribution of this paper. Specifically, automated data generation can improve all Text-to-Image-to-Multi-View and Text-to-Multi-View diffusion models combined with TTR. However, both zero123++ and MVDream used in the comparison were fine-tuned from a variant of stable diffusion, so I do

Code & Models

Models

Datasets

Zery/BS-Objaverse
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Human Motion and Animation · 3D Shape Modeling and Analysis

MethodsDiffusion