Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
L. Koestler, N. Yang, R. Wang, D. Cremers

TL;DR
This paper introduces a novel approach for monocular 3D vehicle detection that learns without 3D bounding box labels by using shape representations and differentiable rendering, reducing labeling effort.
Contribution
It presents a new network architecture and training method that eliminate the need for 3D bounding box labels in monocular 3D detection tasks.
Findings
Achieves competitive results on KITTI dataset without 3D labels.
Outperforms traditional baseline methods in 3D detection accuracy.
Demonstrates the effectiveness of shape-based loss functions.
Abstract
The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. By representing the objects as triangular meshes and employing differentiable shape rendering, we define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, off-the-shelf networks. We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis
