MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition
Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

TL;DR
MAG-VLAQ introduces a foundation-model-enhanced framework for multi-modal aerial-ground place recognition, effectively aligning heterogeneous visual and geometric data to improve cross-view matching accuracy.
Contribution
It proposes ODE-conditioned VLAQ, a novel fusion method that dynamically adapts query centers based on fused multi-modal information, enhancing retrieval performance.
Findings
Nearly doubles state-of-the-art on KITTI360-AG with 61.1% Recall@1.
Effectively aligns multi-modal data for improved cross-view recognition.
Validated on KITTI360-AG and nuScenes-AG datasets.
Abstract
Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
