Open Vocabulary Monocular 3D Object Detection
Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

TL;DR
This paper introduces open-vocabulary monocular 3D object detection, enabling detection of any object category from a single RGB image without relying on costly sensors or limited vocabularies.
Contribution
It presents a framework that leverages pretrained vision models and a new evaluation metric to improve zero-shot and in-domain 3D detection performance.
Findings
Achieves state-of-the-art zero-shot detection results.
Outperforms existing methods on in-domain detection.
Provides a new benchmark and evaluation protocol.
Abstract
We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
