Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen, Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li

TL;DR
Scenethesis is a novel framework that combines language models and vision perception to generate diverse, realistic, and physically plausible 3D scenes from text prompts, overcoming limitations of existing methods.
Contribution
It introduces a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement for 3D scene synthesis.
Findings
Generates diverse and realistic 3D scenes from text.
Ensures physical plausibility and spatial coherence.
Outperforms existing methods in scene realism and diversity.
Abstract
Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Modeling in Geospatial Applications · 3D Surveying and Cultural Heritage · Human Motion and Animation
