Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Khanh Nguyen; Dasith de Silva Edirimuni; Ghulam Mubashar Hassan; Ajmal Mian

arXiv:2512.19088·cs.CV·December 23, 2025

Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian

PDF

Open Access 1 Video

TL;DR

This paper introduces a fast, open-vocabulary 3D object retrieval method that leverages 2D detectors to generate masks, improving generalization and efficiency over previous approaches reliant on heavy image-based models.

Contribution

It proposes a novel approach combining 2D open-vocabulary detection with 3D mask generation for improved retrieval of rare objects in 3D scenes.

Findings

01

Significantly reduces inference time compared to SAM and CLIP-based methods.

02

Improves generalization to infrequent object categories.

03

Enables fast and accurate retrieval of objects from open-ended text queries.

Abstract

Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation· underline

Taxonomy

Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques