TL;DR
This paper introduces a novel framework for crowdsourced 3D mapping that estimates the positions of traffic signs without prior knowledge of camera intrinsics, combining multi-view geometry and self-supervised learning.
Contribution
It presents a new method that jointly estimates camera parameters and 3D landmarks from monocular images and GPS, without needing pre-calibrated cameras.
Findings
Achieved 39cm average relative positioning accuracy
Achieved 1.26m absolute positioning accuracy
Constructed a new KITTI-based traffic sign dataset
Abstract
The ability to efficiently utilize crowdsourced visual data carries immense potential for the domains of large scale dynamic mapping and autonomous driving. However, state-of-the-art methods for crowdsourced 3D mapping assume prior knowledge of camera intrinsics. In this work, we propose a framework that estimates the 3D positions of semantically meaningful landmarks such as traffic signs without assuming known camera intrinsics, using only monocular color camera and GPS. We utilize multi-view geometry as well as deep learning based self-calibration, depth, and ego-motion estimation for traffic sign positioning, and show that combining their strengths is important for increasing the map coverage. To facilitate research on this task, we construct and make available a KITTI based 3D traffic sign ground truth positioning dataset. Using our proposed framework, we achieve an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
