MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali, Wang

TL;DR
MUSES is a multi-modal AI system that enables 3D-controllable image generation from user queries by integrating layout, modeling, and rendering components, advancing the creation of complex 3D scenes from text.
Contribution
Introducing MUSES, a novel multi-modal agent pipeline for 3D-controllable image generation, and creating a new benchmark for complex 3D spatial relationships.
Findings
MUSES achieves state-of-the-art results on T2I-CompBench and T2I-3DisBench.
MUSES outperforms DALL-E 3 and Stable Diffusion 3 in 3D scene generation.
The new benchmark provides detailed descriptions of complex 3D spatial relationships.
Abstract
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModular Robots and Swarm Intelligence · Image Processing and 3D Reconstruction · Robotic Path Planning Algorithms
MethodsDiffusion
