CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang; Dexiang Hong; Yitong Wang; Jie Shao; Xinglong Wu; Zuxuan Wu; Yu-Gang Jiang

arXiv:2412.03859·cs.CV·August 7, 2025

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access 1 Models 2 Datasets

TL;DR

CreatiLayout introduces a multimodal diffusion transformer with a siamese architecture and a large-scale dataset for precise, controllable, and creative layout-to-image generation, leveraging layout planning and optimization.

Contribution

The paper presents SiamLayout, a novel multimodal diffusion transformer with a siamese structure for layout guidance, and introduces LayoutSAM dataset and Layout Designer for enhanced layout-to-image generation.

Findings

01

Effective integration of layout guidance into MM-DiT.

02

Large-scale LayoutSAM dataset with 2.7 million image-text pairs.

03

Improved quality and controllability in layout-to-image generation.

Abstract

Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (\eg SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To inherit the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
HuiZhang0812/CreatiLayout
model· ♡ 3
♡ 3

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Simulation and Modeling Applications

MethodsSparse Evolutionary Training · Diffusion · Focus