TL;DR
This paper introduces DD3D, an end-to-end monocular 3D object detector that leverages depth pre-training without the limitations of pseudo-lidar methods, achieving state-of-the-art results on multiple benchmarks.
Contribution
The authors propose a novel single-stage, end-to-end monocular 3D detection architecture that effectively utilizes depth pre-training, improving over pseudo-lidar based methods.
Findings
Achieves 16.34% AP for Cars on KITTI-3D
Achieves 9.28% AP for Pedestrians on KITTI-3D
Attains 41.5% mAP on NuScenes
Abstract
Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of the intermediate depth estimation network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend to suffer from overfitting more than end-to-end methods, are more complex, and the gap with similar lidar-based detectors remains significant. In this work, we propose an end-to-end, single stage, monocular 3D object detector, DD3D, that can benefit from depth pre-training like pseudo-lidar methods, but without their limitations. Our architecture is designed for effective information transfer between depth estimation and 3D detection, allowing us to scale with the amount of unlabeled pre-training data. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
