ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models
Pranav Saxena, Jimmy Chiun

TL;DR
ZING-3D introduces a zero-shot, incremental 3D scene graph generation framework leveraging pretrained vision-language models, enabling semantic understanding and geometric grounding in complex environments without task-specific training.
Contribution
It is the first to combine zero-shot recognition, incremental updates, and 3D geometric grounding in scene graphs using pretrained models for robotics applications.
Findings
Effective in capturing spatial and relational knowledge
Operates without task-specific training
Supports open-vocabulary object recognition
Abstract
Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
