Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang

TL;DR
This paper introduces Laytrol, a layout control network that preserves pretrained knowledge in diffusion models for improved layout-to-image generation, utilizing a new dataset and specialized initialization schemes.
Contribution
We propose Laytrol, a novel layout control network that maintains pretrained knowledge in diffusion models, along with the LaySyn dataset to reduce distribution shift.
Findings
Laytrol improves image quality and layout accuracy.
The method preserves pretrained knowledge effectively.
Experiments show superior performance over existing methods.
Abstract
With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Multimodal Machine Learning Applications
