Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene   Graph

Sergey Linok; Tatiana Zemskova; Svetlana Ladanova; Roman Titkov,; Dmitry Yudin; Maxim Monastyrny; Aleksei Valenkov

arXiv:2406.07113·cs.CV·May 7, 2025·2 cites

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov,, Dmitry Yudin, Maxim Monastyrny, Aleksei Valenkov

PDF

Open Access

TL;DR

This paper introduces BBQ, a modular approach for 3D object grounding in complex natural language queries, utilizing scene graphs, spatial relations, and large language models to improve accuracy and speed in robotics applications.

Contribution

We propose BBQ, a novel modular framework that integrates scene graph reasoning, spatial relations, and large language models for open-vocabulary 3D object grounding.

Findings

01

BBQ outperforms existing zero-shot methods on Replica and ScanNet datasets.

02

Leveraging spatial relations improves grounding accuracy in scenes with similar objects.

03

Our approach achieves real-time performance suitable for robotic applications.

Abstract

Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings