ConsistCompose: Unified Multimodal Layout Control for Image Composition

Xuanke Shi; Boxuan Li; Xiaoyang Han; Zhongang Cai; Lei Yang; Quan Wang; Dahua Lin

arXiv:2511.18333·cs.CV·March 17, 2026

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Quan Wang, Dahua Lin

PDF

Open Access 1 Models 1 Datasets

TL;DR

ConsistCompose introduces a unified multimodal framework embedding layout coordinates into language prompts, enabling precise, layout-controlled multi-instance image generation and outperforming existing methods in spatial accuracy.

Contribution

It presents a novel unified framework and dataset for layout-controlled multimodal image generation, enabling precise spatial control without task-specific modules.

Findings

01

Improves spatial accuracy over baselines

02

Preserves identity fidelity in generated images

03

Demonstrates competitive multimodal understanding

Abstract

Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sensenova/ConsistCompose-BAGEL-7B-MoT
model· 3 dl· ♡ 6
3 dl♡ 6

Datasets

sensenova/ConsistCompose3M
dataset· 356 dl
356 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling