TL;DR
This paper introduces a large-scale dataset and a novel joint embedding method for cross-modal visual localization, matching ground RGB images to aerial LIDAR data, significantly advancing scalability and performance.
Contribution
It presents the first large-scale dataset and a new embedding approach for effective cross-modal localization between RGB and LIDAR images.
Findings
Achieved median rank of 5 in large-scale cross-modal matching
Created a dataset with over 550K image pairs covering 143 km^2
Demonstrated improved performance over prior methods
Abstract
We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
