Multimodal Large Language Models for Multi-Subject In-Context Image Generation
Yucheng Zhou, Dubing Chen, Huan Zheng, Jianbing Shen

TL;DR
MUSIC is a novel multimodal large language model designed for multi-subject in-context image generation, addressing subject missing and semantic drift with a scalable data pipeline, reasoning mechanisms, and a new benchmark.
Contribution
The paper introduces MUSIC, the first MLLM tailored for multi-subject in-context image generation, with innovative data generation, reasoning, and layout planning techniques.
Findings
MUSIC outperforms existing methods in multi- and single-subject scenarios.
The model effectively manages complex subject images and semantic relationships.
A new benchmark MSIC evaluates multi-subject in-context generation capabilities.
Abstract
Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities increases, existing methods often suffer from subject missing and semantic drift. To address this problem, we propose MUSIC, the first MLLM specifically designed for \textbf{MU}lti-\textbf{S}ubject \textbf{I}n-\textbf{C}ontext image generation. To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. Furthermore, we enhance the model's understanding of multi-subject semantic relationships through a vision chain-of-thought (CoT) mechanism, guiding step-by-step reasoning from subject images to semantics and generation. To mitigate identity entanglement and manage visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
