Training an Open-Vocabulary Monocular 3D Object Detection Model without   3D Data

Rui Huang; Henry Zheng; Yan Wang; Zhuofan Xia; Marco Pavone; Gao Huang

arXiv:2411.15657·cs.CV·November 26, 2024

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang

PDF

Open Access

TL;DR

This paper introduces OVM3D-Det, a cost-effective monocular 3D object detection framework that trains solely on RGB images using pseudo-labels and innovative label refinement techniques, enabling open-vocabulary detection without 3D sensors.

Contribution

It presents a novel RGB-only training approach for open-vocabulary 3D detection using pseudo-LiDAR and large language model priors, eliminating the need for expensive 3D data.

Findings

01

Outperforms baselines in indoor scenarios

02

Effective label calibration with adaptive pseudo-LiDAR erosion

03

Enables scalable open-vocabulary 3D detection without 3D sensors

Abstract

Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image Processing and 3D Reconstruction

MethodsSoftmax · Attention Is All You Need