RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

Wentang Chen; Shougao Zhang; Yiman Zhang; Tianhao Zhou; Ruihui Li

arXiv:2512.11234·cs.CV·May 20, 2026

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

PDF

TL;DR

RoomPilot is a framework that enables controllable indoor scene generation from multi-modal inputs like text and CAD plans, using a structured semantic language and hierarchical synthesis for realistic, coherent 3D scenes.

Contribution

It introduces IDSL, a semantic language for indoor scenes, and a hierarchical pipeline for precise, multi-modal controllable scene synthesis.

Findings

01

Effective multi-modal understanding demonstrated

02

High controllability in scene generation achieved

03

Improved visual realism and physical consistency

Abstract

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications