Delving into the Pre-training Paradigm of Monocular 3D Object Detection
Zhuoling Li, Chuanrui Zhang, En Yu, Haoqian Wang

TL;DR
This paper investigates pre-training strategies for monocular 3D object detection, proposing methods that leverage unlabeled data to significantly improve detection accuracy on KITTI-3D and nuScenes benchmarks.
Contribution
It introduces a novel pre-training framework based on depth estimation and 2D detection, with strategies that enhance representation learning for M3OD tasks.
Findings
Pre-training with the proposed methods boosts Car AP3D70 score by 18.71% on KITTI.
The approach improves nuScenes NDS score by 40.41% relative.
The framework effectively leverages unlabeled data to enhance 3D detection performance.
Abstract
The labels of monocular 3D object detection (M3OD) are expensive to obtain. Meanwhile, there usually exists numerous unlabeled data in practical applications, and pre-training is an efficient way of exploiting the knowledge in unlabeled data. However, the pre-training paradigm for M3OD is hardly studied. We aim to bridge this gap in this work. To this end, we first draw two observations: (1) The guideline of devising pre-training tasks is imitating the representation of the target task. (2) Combining depth estimation and 2D object detection is a promising M3OD pre-training baseline. Afterwards, following the guideline, we propose several strategies to further improve this baseline, which mainly include target guided semi-dense depth estimation, keypoint-aware 2D object detection, and class-level loss adjustment. Combining all the developed techniques, the obtained pre-training framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
