MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object   Diffusion

Sen Li; Ruochen Wang; Cho-Jui Hsieh; Minhao Cheng; Tianyi Zhou

arXiv:2402.12741·cs.CV·May 27, 2024·1 cites

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Sen Li, Ruochen Wang, Cho-Jui Hsieh, Minhao Cheng, Tianyi Zhou

PDF

Open Access 1 Repo

TL;DR

MuLan is a training-free multimodal LLM agent that progressively generates multi-object images with intricate spatial and attribute control, enabling better multi-object image synthesis and human-AI collaboration.

Contribution

MuLan introduces a novel, training-free, multi-step approach combining LLM and VLM to generate multi-object images with precise spatial and attribute control, enhancing flexibility and collaboration.

Findings

01

Outperforms baselines in multi-object image generation

02

Enables interactive human-in-the-loop editing

03

Demonstrates superior creativity and control

Abstract

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

measure-infinity/mulan-code
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsDiffusion