ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

Pranav Saxena; Jimmy Chiun

arXiv:2510.21069·cs.CV·October 27, 2025

ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

Pranav Saxena, Jimmy Chiun

PDF

TL;DR

ZING-3D introduces a zero-shot, incremental 3D scene graph generation framework leveraging pretrained vision-language models, enabling semantic understanding and geometric grounding in complex environments without task-specific training.

Contribution

It is the first to combine zero-shot recognition, incremental updates, and 3D geometric grounding in scene graphs using pretrained models for robotics applications.

Findings

01

Effective in capturing spatial and relational knowledge

02

Operates without task-specific training

03

Supports open-vocabulary object recognition

Abstract

Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.