Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance   Segmentation

Mohamed El Amine Boudjoghra; Angela Dai; Jean Lahoud; Hisham; Cholakkal; Rao Muhammad Anwer; Salman Khan; Fahad Shahbaz Khan

arXiv:2406.02548·cs.CV·February 14, 2025·3 cites

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham, Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

PDF

Open Access 1 Repo 1 Models

TL;DR

Open-YOLO 3D introduces a fast, accurate open-vocabulary 3D instance segmentation method that relies solely on 2D object detection, significantly reducing inference time while maintaining state-of-the-art performance.

Contribution

The paper proposes a novel approach that leverages 2D object detection for 3D segmentation, avoiding heavy reliance on computationally expensive multi-view 3D features.

Findings

01

Achieves up to 16x speedup over existing methods.

02

Attains 24.7% mAP on ScanNet200 with 22 seconds per scene.

03

State-of-the-art performance on benchmark datasets.

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aminebdj/openyolo3d
pytorchOfficial

Models

🤗
mohamed-boudjoghra/Open-YOLO3D
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training · Segment Anything Model