Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Yucheng Zhou; Dubing Chen; Huan Zheng; Jianbing Shen

arXiv:2604.07422·cs.LG·April 10, 2026

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Yucheng Zhou, Dubing Chen, Huan Zheng, Jianbing Shen

PDF

TL;DR

MUSIC is a novel multimodal large language model designed for multi-subject in-context image generation, addressing subject missing and semantic drift with a scalable data pipeline, reasoning mechanisms, and a new benchmark.

Contribution

The paper introduces MUSIC, the first MLLM tailored for multi-subject in-context image generation, with innovative data generation, reasoning, and layout planning techniques.

Findings

01

MUSIC outperforms existing methods in multi- and single-subject scenarios.

02

The model effectively manages complex subject images and semantic relationships.

03

A new benchmark MSIC evaluates multi-subject in-context generation capabilities.

Abstract

Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities increases, existing methods often suffer from subject missing and semantic drift. To address this problem, we propose MUSIC, the first MLLM specifically designed for \textbf{MU}lti-\textbf{S}ubject \textbf{I}n-\textbf{C}ontext image generation. To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. Furthermore, we enhance the model's understanding of multi-subject semantic relationships through a vision chain-of-thought (CoT) mechanism, guiding step-by-step reasoning from subject images to semantics and generation. To mitigate identity entanglement and manage visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.