Scenethesis: A Language and Vision Agentic Framework for 3D Scene   Generation

Lu Ling; Chen-Hsuan Lin; Tsung-Yi Lin; Yifan Ding; Yu Zeng; Yichen; Sheng; Yunhao Ge; Ming-Yu Liu; Aniket Bera; Zhaoshuo Li

arXiv:2505.02836·cs.CV·May 6, 2025

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen, Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li

PDF

Open Access

TL;DR

Scenethesis is a novel framework that combines language models and vision perception to generate diverse, realistic, and physically plausible 3D scenes from text prompts, overcoming limitations of existing methods.

Contribution

It introduces a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement for 3D scene synthesis.

Findings

01

Generates diverse and realistic 3D scenes from text.

02

Ensures physical plausibility and spatial coherence.

03

Outperforms existing methods in scene realism and diversity.

Abstract

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Modeling in Geospatial Applications · 3D Surveying and Cultural Heritage · Human Motion and Animation