MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu; Yuhang Ming; Zhihao Zhan; Hanyu Zhu; Javier Civera; Wanzeng Kong

arXiv:2605.09418·cs.CV·May 12, 2026

MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

PDF

TL;DR

MAG-VLAQ introduces a foundation-model-enhanced framework for multi-modal aerial-ground place recognition, effectively aligning heterogeneous visual and geometric data to improve cross-view matching accuracy.

Contribution

It proposes ODE-conditioned VLAQ, a novel fusion method that dynamically adapts query centers based on fused multi-modal information, enhancing retrieval performance.

Findings

01

Nearly doubles state-of-the-art on KITTI360-AG with 61.1% Recall@1.

02

Effectively aligns multi-modal data for improved cross-view recognition.

03

Validated on KITTI360-AG and nuScenes-AG datasets.

Abstract

Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.