Open Vocabulary Monocular 3D Object Detection

Jin Yao; Hao Gu; Xuweiyi Chen; Jiayun Wang; Zezhou Cheng

arXiv:2411.16833·cs.CV·November 27, 2025

Open Vocabulary Monocular 3D Object Detection

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces open-vocabulary monocular 3D object detection, enabling detection of any object category from a single RGB image without relying on costly sensors or limited vocabularies.

Contribution

It presents a framework that leverages pretrained vision models and a new evaluation metric to improve zero-shot and in-domain 3D detection performance.

Findings

01

Achieves state-of-the-art zero-shot detection results.

02

Outperforms existing methods on in-domain detection.

03

Provides a new benchmark and evaluation protocol.

Abstract

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UVA-Computer-Vision-Lab/ovmono3d
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsSparse Evolutionary Training