Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View   Diffusion Model

Xiaolong Li; Jiawei Mo; Ying Wang; Chethan Parameshwara; Xiaohan Fei,; Ashwin Swaminathan; CJ Taylor; Zhuowen Tu; Paolo Favaro; Stefano Soatto

arXiv:2404.18065·cs.CV·April 30, 2024

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei,, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

PDF

Open Access

TL;DR

This paper introduces Grounded-Dreamer, a two-stage method that improves text-to-3D generation by using multi-view diffusion models, attention refocusing, and hybrid optimization to produce accurate, high-quality, and diverse 3D assets from complex prompts.

Contribution

The paper presents a novel two-stage approach that enhances compositional text-to-3D generation without retraining models or requiring high-quality datasets.

Findings

01

Outperforms previous SOTA in quality and accuracy

02

Enables diverse 3D generation from the same prompt

03

Effectively captures complex, compositional prompts

Abstract

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · 3D Shape Modeling and Analysis · Image Retrieval and Classification Techniques

MethodsDiffusion