Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

Stefan Ainetter; Thomas Deixelberger; Edoardo A. Dominici; Philipp Drescher; Konstantinos Vardis; Markus Steinberger

arXiv:2603.13910·cs.CV·March 17, 2026

Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

Stefan Ainetter, Thomas Deixelberger, Edoardo A. Dominici, Philipp Drescher, Konstantinos Vardis, Markus Steinberger

PDF

Open Access

TL;DR

GuidedSceneGen is a novel framework for text-to-3D indoor scene generation that ensures absolute scale accuracy, semantic interpretability, and spatial coherence through a combination of global layout prediction, diffusion models, and 3D fusion.

Contribution

It introduces a method that maintains an absolute coordinate frame, integrates semantic and geometric guidance, and employs efficient diffusion models for high-quality, consistent 3D scene synthesis from text.

Findings

01

Achieves up to 10x faster scene sampling with guided camera trajectories.

02

Produces more spatially coherent and semantically accurate 3D scenes than previous methods.

03

Enables accurate transfer of object poses and supports scene expansion without re-alignment.

Abstract

We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360{\deg} imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · 3D Shape Modeling and Analysis