SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

TL;DR
SceneSmith is a hierarchical framework that generates diverse, physically realistic indoor scenes from natural language prompts, significantly improving scene complexity and realism for robotic simulation.
Contribution
It introduces a novel agentic, multi-stage approach combining text-to-3D synthesis, dataset retrieval, and physical estimation to create detailed, simulation-ready indoor environments.
Findings
Generates 3-6x more objects than prior methods
Achieves less than 2% inter-object collisions
96% of objects remain stable under physics simulation
Abstract
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stagesfrom architectural layout to furniture placement to small object populationeach implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
