RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing
Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

TL;DR
RoomPilot is a framework that enables controllable indoor scene generation from multi-modal inputs like text and CAD plans, using a structured semantic language and hierarchical synthesis for realistic, coherent 3D scenes.
Contribution
It introduces IDSL, a semantic language for indoor scenes, and a hierarchical pipeline for precise, multi-modal controllable scene synthesis.
Findings
Effective multi-modal understanding demonstrated
High controllability in scene generation achieved
Improved visual realism and physical consistency
Abstract
Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
