Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai

TL;DR
This paper presents DirectLayout, a novel framework that generates 3D indoor scene layouts directly from text descriptions by leveraging large language models and spatial reasoning, improving flexibility and alignment with user instructions.
Contribution
It introduces a three-stage process for text-to-3D layout generation using LLMs, Chain-of-Thought reasoning, and iterative alignment, addressing dataset limitations and enhancing scene plausibility.
Findings
Achieves high semantic consistency in generated layouts
Demonstrates strong generalization across diverse scenes
Ensures physically plausible object placements
Abstract
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
