Lay-Your-Scene: Natural Scene Layout Generation with Diffusion   Transformers

Divyansh Srivastava; Xiang Zhang; He Wen; Chenru Wen; Zhuowen Tu

arXiv:2505.04718·cs.CV·May 9, 2025

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava, Xiang Zhang, He Wen, Chenru Wen, Zhuowen Tu

PDF

Open Access 1 Models 1 Datasets

TL;DR

Lay-Your-Scene introduces a new open-vocabulary scene layout generation method using diffusion transformers, outperforming existing models and enabling applications like image editing and improved scene initialization.

Contribution

It proposes a lightweight, open-source language model-based pipeline with a novel diffusion Transformer architecture for controllable, open-vocabulary scene layout generation.

Findings

01

Outperforms existing scene layout methods on spatial reasoning benchmarks.

02

Achieves state-of-the-art performance in open-vocabulary layout generation.

03

Demonstrates effective integration with large language models for initialization and image editing.

Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
dsrivastavv/Lay-Your-Scene
model

Datasets

dsrivastavv/COCOCaptionGrounded
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis

MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Diffusion · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding