MUSES: 3D-Controllable Image Generation via Multi-Modal Agent   Collaboration

Yanbo Ding; Shaobin Zhuang; Kunchang Li; Zhengrong Yue; Yu Qiao; Yali; Wang

arXiv:2408.10605·cs.CV·December 17, 2024

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali, Wang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

MUSES is a multi-modal AI system that enables 3D-controllable image generation from user queries by integrating layout, modeling, and rendering components, advancing the creation of complex 3D scenes from text.

Contribution

Introducing MUSES, a novel multi-modal agent pipeline for 3D-controllable image generation, and creating a new benchmark for complex 3D spatial relationships.

Findings

01

MUSES achieves state-of-the-art results on T2I-CompBench and T2I-3DisBench.

02

MUSES outperforms DALL-E 3 and Stable Diffusion 3 in 3D scene generation.

03

The new benchmark provides detailed descriptions of complex 3D spatial relationships.

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DINGYANB/MUSES
pytorchOfficial

Models

🤗
yanboding/MUSES
model· ♡ 1
♡ 1

Videos

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration· underline

Taxonomy

TopicsModular Robots and Swarm Intelligence · Image Processing and 3D Reconstruction · Robotic Path Planning Algorithms

MethodsDiffusion