Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

TL;DR
Zoo3D introduces a training-free, zero-shot 3D object detection framework that constructs 3D bounding boxes and assigns semantic labels without prior training, achieving state-of-the-art results on multiple benchmarks.
Contribution
It is the first training-free 3D detection method that operates in zero-shot mode and extends to images, significantly advancing open-vocabulary 3D understanding.
Findings
Zero-shot Zoo3D$_0$ outperforms existing self-supervised methods.
Both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results.
The method works directly with posed and unposed images.
Abstract
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D, which requires no training at all, and the self-supervised Zoo3D, which refines 3D box prediction by training a class-agnostic detector on Zoo3D-generated pseudo…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
