Agentic 3D Scene Generation with Spatially Contextualized VLMs
Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

TL;DR
This paper introduces a novel framework that enhances vision-language models with a structured, spatially contextual understanding of 3D scenes, enabling more effective generation, editing, and reasoning in complex environments.
Contribution
It presents a new paradigm for VLMs to generate and understand 3D scenes by integrating a dynamic spatial context comprising a scene portrait, labeled point cloud, and scene hypergraph.
Findings
Framework handles diverse, challenging inputs effectively.
Enables downstream tasks like scene editing and path planning.
Demonstrates improved generalization over prior methods.
Abstract
Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. the paper’s notion of “spatially contextualized VLMs” is original and well-motivated. It reinterprets VLMs as reasoning agents operating over structured 3D contexts. 2. The ability to handle both single images and unstructured image collections is impressive and practically relevant.
1. reported metrics (CLIP, BLIP, LPIPS) evaluate rendered 2D projections. There is no quantitative evaluation of 3D accuracy, geometry consistency, or spatial relation correctness—critical aspects for a 3D generation paper. 2. while conceptually coherent, the pipeline involves multiple submodules (Fast3R, Point-M2AE, Meshy, Blender), which could make it hard to scale or analyze systematically. 3. while the conceptual framing of “spatially contextualized VLMs” is interesting, many subcomponents:
1. The paper proposes using a structured Spatial Context to provide detailed descriptions of the environment to be generated. This structured data serves as input to the VLM, guiding the 3D scene generation process. 2. During the 3D scene generation, the VLM continuously reads from the Spatial Context, ensuring that the scene remains semantically coherent and geometrically accurate.
1. In the initialization stage, the pipeline employs Fast3R to construct an initial 3D scene from an image. However, the generated scene contains occlusion relationships between instance assets and the background. Although the paper mentions repairing and generating the invisible regions of the assets, no corresponding repair is conducted for the background. How is the reconstruction completeness of the occluded background areas ensured? According to the supplementary videos, this method can pro
The core idea of augmenting VLMs with a structured and persistent spatial context is conceptually compelling. This context comprises two key component: the scene portrait and the scene hypergrap, which jointly capture both the semantic content and spatial relationships within a scene in a coherent and interpretable manner. The integration of partial point clouds as geometric priors for individual assets provides a principled mechanism for grounding abstract multimodal inputs into 3D space, effec
## Problems on Writing: 1. The authors state in Section 4.2: *Due to space limitations, we provide only the ablation study on environment setup in the main paper;additional ablations are included in the supplementary material.* However, the follow-up content includes the ablation study on environment setup and layout planning. 2. Table 1 reports AQ (aesthetic quality) and FP (functional plausibility) under columns labeled “(4o/User)(↑)”, suggesting scores from both GPT-4o and human users. Howeve
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Remote Sensing and LiDAR Applications · Computer Graphics and Visualization Techniques
