Agentic 3D Scene Generation with Spatially Contextualized VLMs

Xinhang Liu; Yu-Wing Tai; Chi-Keung Tang

arXiv:2505.20129·cs.CV·July 8, 2025

Agentic 3D Scene Generation with Spatially Contextualized VLMs

Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel framework that enhances vision-language models with a structured, spatially contextual understanding of 3D scenes, enabling more effective generation, editing, and reasoning in complex environments.

Contribution

It presents a new paradigm for VLMs to generate and understand 3D scenes by integrating a dynamic spatial context comprising a scene portrait, labeled point cloud, and scene hypergraph.

Findings

01

Framework handles diverse, challenging inputs effectively.

02

Enables downstream tasks like scene editing and path planning.

03

Demonstrates improved generalization over prior methods.

Abstract

Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. the paper’s notion of “spatially contextualized VLMs” is original and well-motivated. It reinterprets VLMs as reasoning agents operating over structured 3D contexts. 2. The ability to handle both single images and unstructured image collections is impressive and practically relevant.

Weaknesses

1. reported metrics (CLIP, BLIP, LPIPS) evaluate rendered 2D projections. There is no quantitative evaluation of 3D accuracy, geometry consistency, or spatial relation correctness—critical aspects for a 3D generation paper. 2. while conceptually coherent, the pipeline involves multiple submodules (Fast3R, Point-M2AE, Meshy, Blender), which could make it hard to scale or analyze systematically. 3. while the conceptual framing of “spatially contextualized VLMs” is interesting, many subcomponents:

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper proposes using a structured Spatial Context to provide detailed descriptions of the environment to be generated. This structured data serves as input to the VLM, guiding the 3D scene generation process. 2. During the 3D scene generation, the VLM continuously reads from the Spatial Context, ensuring that the scene remains semantically coherent and geometrically accurate.

Weaknesses

1. In the initialization stage, the pipeline employs Fast3R to construct an initial 3D scene from an image. However, the generated scene contains occlusion relationships between instance assets and the background. Although the paper mentions repairing and generating the invisible regions of the assets, no corresponding repair is conducted for the background. How is the reconstruction completeness of the occluded background areas ensured? According to the supplementary videos, this method can pro

Reviewer 03Rating 6Confidence 3

Strengths

The core idea of augmenting VLMs with a structured and persistent spatial context is conceptually compelling. This context comprises two key component: the scene portrait and the scene hypergrap, which jointly capture both the semantic content and spatial relationships within a scene in a coherent and interpretable manner. The integration of partial point clouds as geometric priors for individual assets provides a principled mechanism for grounding abstract multimodal inputs into 3D space, effec

Weaknesses

## Problems on Writing: 1. The authors state in Section 4.2: *Due to space limitations, we provide only the ablation study on environment setup in the main paper;additional ablations are included in the supplementary material.* However, the follow-up content includes the ablation study on environment setup and layout planning. 2. Table 1 reports AQ (aesthetic quality) and FP (functional plausibility) under columns labeled “(4o/User)(↑)”, suggesting scores from both GPT-4o and human users. Howeve

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Remote Sensing and LiDAR Applications · Computer Graphics and Visualization Techniques