Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data
Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang

TL;DR
This paper introduces OVM3D-Det, a cost-effective monocular 3D object detection framework that trains solely on RGB images using pseudo-labels and innovative label refinement techniques, enabling open-vocabulary detection without 3D sensors.
Contribution
It presents a novel RGB-only training approach for open-vocabulary 3D detection using pseudo-LiDAR and large language model priors, eliminating the need for expensive 3D data.
Findings
Outperforms baselines in indoor scenarios
Effective label calibration with adaptive pseudo-LiDAR erosion
Enables scalable open-vocabulary 3D detection without 3D sensors
Abstract
Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image Processing and 3D Reconstruction
MethodsSoftmax · Attention Is All You Need
