CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling
Yingrui Wu, Youkang Kong, Mingyang Zhao, Weize Quan, Dong-Ming Yan, Yang Liu

TL;DR
CasLayout is a cascaded diffusion framework for indoor scene synthesis that models spatial and semantic relations explicitly, reducing data needs and improving controllability and realism.
Contribution
The paper introduces a novel cascaded diffusion approach with explicit relation modeling and sparse graphs, enabling flexible, zero-shot, and more accurate indoor scene generation.
Findings
Achieves state-of-the-art fidelity and diversity in scene synthesis.
Enhances relational controllability through sparse graph encoding.
Supports zero-shot image-to-scene generation with LLMs and VLMs.
Abstract
Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
