Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham, Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

TL;DR
Open-YOLO 3D introduces a fast, accurate open-vocabulary 3D instance segmentation method that relies solely on 2D object detection, significantly reducing inference time while maintaining state-of-the-art performance.
Contribution
The paper proposes a novel approach that leverages 2D object detection for 3D segmentation, avoiding heavy reliance on computationally expensive multi-view 3D features.
Findings
Achieves up to 16x speedup over existing methods.
Attains 24.7% mAP on ScanNet200 with 22 seconds per scene.
State-of-the-art performance on benchmark datasets.
Abstract
Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · Advanced Neural Network Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training · Segment Anything Model
