Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Divyansh Srivastava, Xiang Zhang, He Wen, Chenru Wen, Zhuowen Tu

TL;DR
Lay-Your-Scene introduces a new open-vocabulary scene layout generation method using diffusion transformers, outperforming existing models and enabling applications like image editing and improved scene initialization.
Contribution
It proposes a lightweight, open-source language model-based pipeline with a novel diffusion Transformer architecture for controllable, open-vocabulary scene layout generation.
Findings
Outperforms existing scene layout methods on spatial reasoning benchmarks.
Achieves state-of-the-art performance in open-vocabulary layout generation.
Demonstrates effective integration with large language models for initialization and image editing.
Abstract
We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis
MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Diffusion · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding
