SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

TL;DR
SceneAssistant introduces a visual-feedback-driven agent that leverages vision-language models to generate diverse, high-quality 3D scenes from natural language, enabling open-vocabulary scene synthesis and editing with iterative refinement.
Contribution
It presents a novel framework combining 3D object generation and vision-language models for open-vocabulary scene creation and editing, overcoming domain restrictions of prior methods.
Findings
Outperforms existing methods in diversity and quality of 3D scene generation
Enables natural language scene editing and refinement
Demonstrates effectiveness through qualitative and quantitative evaluations
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
