SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo; Jiaxiang Tang; Ruijie Lu; Gang Zeng

arXiv:2603.12238·cs.CV·March 13, 2026

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

PDF

Open Access

TL;DR

SceneAssistant introduces a visual-feedback-driven agent that leverages vision-language models to generate diverse, high-quality 3D scenes from natural language, enabling open-vocabulary scene synthesis and editing with iterative refinement.

Contribution

It presents a novel framework combining 3D object generation and vision-language models for open-vocabulary scene creation and editing, overcoming domain restrictions of prior methods.

Findings

01

Outperforms existing methods in diversity and quality of 3D scene generation

02

Enables natural language scene editing and refinement

03

Demonstrates effectiveness through qualitative and quantitative evaluations

Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis