Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation
Stefan Ainetter, Thomas Deixelberger, Edoardo A. Dominici, Philipp Drescher, Konstantinos Vardis, Markus Steinberger

TL;DR
GuidedSceneGen is a novel framework for text-to-3D indoor scene generation that ensures absolute scale accuracy, semantic interpretability, and spatial coherence through a combination of global layout prediction, diffusion models, and 3D fusion.
Contribution
It introduces a method that maintains an absolute coordinate frame, integrates semantic and geometric guidance, and employs efficient diffusion models for high-quality, consistent 3D scene synthesis from text.
Findings
Achieves up to 10x faster scene sampling with guided camera trajectories.
Produces more spatially coherent and semantically accurate 3D scenes than previous methods.
Enables accurate transfer of object poses and supports scene expansion without re-alignment.
Abstract
We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360{\deg} imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
