GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Zhangyang Qi; Zhixiong Zhang; Ye Fang; Jiaqi Wang; Hengshuang Zhao

arXiv:2501.01428·cs.CV·March 12, 2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, Hengshuang Zhao

PDF

Open Access 1 Repo 1 Models 2 Datasets 3 Reviews

TL;DR

GPT4Scene introduces a vision-based paradigm that enhances 3D scene understanding in vision-language models by building global-local scene relationships, significantly improving performance on indoor scene comprehension tasks.

Contribution

The paper proposes GPT4Scene, a novel visual prompting method that improves 3D spatial understanding in VLMs by constructing BEV images and marking object IDs, enabling better global-local scene comprehension.

Findings

01

GPT4Scene improves zero-shot 3D understanding performance.

02

Fine-tuning with GPT4Scene achieves state-of-the-art results.

03

Models develop intrinsic 3D understanding without explicit prompts.

Abstract

In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, which is critical for embodied intelligence, remains limited. Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial knowledge, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. I think this paper is well motivated and has a clear conceptual advance over prior VLMs for 3D. Unlike previous methods that rely heavily on point clouds, the paper proposes a vision-only approach for 3D spatial reasoning, closely mimicking human perceptual processes for scene understanding. 2. The empirical results is solid. Across diverse 3D benchmarks (Tables 3–7, and full results in tables 12–16), GPT4Scene models consistently set new state-of-the-art or strongly outperform both point-ba

Weaknesses

1. Since the pipeline is dependent on 3D reconstruction and instance segmentation, I have a concern about its robustness. The method assumes reliable 3D scene reconstruction and high-quality mask annotations for generating BEV images and STO-markers. While Table 7 ablation demonstrates some robustness, the reliance on Mask3D and BundleFusion or similar systems (Figure 2) acts as a performance bottleneck. Real-world deployment may face significant degradation under varied lighting, occlusions, or

Reviewer 02Rating 4Confidence 4

Strengths

- The paper tackles an interesting and important problem of enabling 3D understanding using pre-trained 2D VLMs - The ablations are thorough and support the main design choices of the paper

Weaknesses

The paper appears to have several wrong claims or statements, and formatting errors. - L199-200: “our method… matching or surpassing the Chat-Scene Chat-scene” in Table-1. However, from Table-1, it seems that the version with GPT4Scene never surpasses ChatScene. - From the introduction, it appears that a new datasets is being created and released i.e. ScanAlign. However, it is essentially a combination of existing (and popular) 3D grounding and captioning datasets with the addition of STO mark

Reviewer 03Rating 6Confidence 3

Strengths

1. Extensive quantitative experiments are provided. 2. A new dataset, ScanAlign, is introduced, containing video frames, BEV images with STO markers, and text annotations. 3. The use of spatio-temporal object markers is a well-motivated and effective idea that has not been explored before and yields improvements, particularly in 3D grounding tasks.

Weaknesses

1. For 3D instance segmentation, the authors rely on Mask3D, which is pre-trained on ScanNet scenes. How well do the predicted masks generalize to other 3D environments (e.g., ARKitScenes)? 2. While the performance improvements are appreciated, the BEV images have been explored in prior work. Thus, the main methodological contribution lies in the introduction of STO markers, which somewhat limits the overall novelty of the approach.

Code & Models

Repositories

Qi-Zhangyang/GPT4Scene
pytorchOfficial

Models

🤗
alexzyqi/GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512
model· 29 dl
29 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications