Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching
Gongxin Yao, Xinyang Li, Luowei Fu, Yu Pan

TL;DR
This paper presents a novel cross-modal descriptor learning framework for monocular camera localization within LiDAR maps, leveraging multi-view matching and contrastive learning to improve place recognition accuracy.
Contribution
It introduces a new multi-view, cross-modal descriptor learning approach using a visual state space model and contrastive training for improved LiDAR map-based localization.
Findings
Effective in KITTI datasets
Generalizes well across different scenes
Reduces computational overhead compared to SLAM
Abstract
Achieving monocular camera localization within pre-built LiDAR maps can bypass the simultaneous mapping process of visual SLAM systems, potentially reducing the computational overhead of autonomous localization. To this end, one of the key challenges is cross-modal place recognition, which involves retrieving 3D scenes (point clouds) from a LiDAR map according to online RGB images. In this paper, we introduce an efficient framework to learn descriptors for both RGB images and point clouds. It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning. To address the field-of-view differences, independent descriptors are generated from multiple evenly distributed viewpoints for point clouds. A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote Sensing and Land Use · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
