Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
Xingtao Ling Yingying Zhu

TL;DR
This paper introduces a novel dual attention approach with cross-view interaction and multi-scale spatial features to improve cross-view object geo-localization, effectively reducing noise and enhancing localization accuracy.
Contribution
The paper proposes the CVCAM and MHSAM modules for better cross-view feature interaction and multi-scale spatial feature extraction, along with a new G2D dataset for ground-to-drone localization.
Findings
Achieves state-of-the-art localization accuracy on CVOGL and G2D datasets.
Effectively suppresses irrelevant edge noise in spatial relationship maps.
Demonstrates the effectiveness of multi-scale spatial features in improving localization.
Abstract
Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper is well-written. 2. The method achieves state-of-the-art performance with lower computational cost.
1. The method employs a cross-attention transformer (CVCAM) to extract a fused feature map from two views, followed by a UNet (MHSAM) for capturing local information. This approach lacks novelty. 2. The dataset contribution is unclear. It seems the images are sourced from the original CVOGL dataset. Have you merely created pair lists, or have you added any additional labels to enhance the dataset?
The issue raised in the article is critical, as the focus on edge noise is unrelated to localization, and the G2D data is crucial for achieving accurate and applicable geo-localization. The overall style of the article is simple and clear, making it very accessible for readers who are unfamiliar with geo-localization. The proposed algorithm achieves state-of-the-art accuracy on two cross-view datasets and demonstrates strong adaptability across different perspectives, including G2S, D2G, and G2D
Sections 3.1 and 3.2 of the article focus on the implementation details and operational processes of the two main algorithm modules, specifically regarding commonly used multi-head attention and positional encoding modules. However, they do not explain how CVCAM effectively reduces the attention on irrelevant edge noise, as mentioned in the introduction, nor do they clarify how MHSAM further processes the spatial relationships. The principles behind how these two modules address the identified p
1.The experiments demonstrate that AttenGeo achieves a high level of localization accuracy, significantly surpassing state-of-the-art methods, which suggests the effectiveness of the proposed approach. 2.The introduction of the G2D dataset is a valuable contribution to the cross-view object geo-localization research community, and it can facilitate further developments in the field.
1. The explanation of the model’s architecture could be clearer. For instance, in Section 3.2, the authors mention that the query and reference features are passed through two cross-attention blocks. However, only images from a single view are fed into this part of the model, why it is referred to as cross-attention rather than self-attention? 2. The novelty of this work appears limited. The use of cross-view attention and the implementation of cross-attention and spatial attention modules are
1. The paper is well-structured and organized, presenting technical content in an accessible, easy-to-follow manner. Each section flows logically, allowing readers to understand the proposed approach's motivation and functionality. 2. Introducing the G2D dataset for Ground-to-Drone localization addresses a notable gap in existing resources, enabling new research into localization tasks across ground and aerial views. This dataset is a valuable addition, supporting further exploration and benchm
1. The paper leans heavily towards an engineering approach, with its primary contributions being the two attention-based modules, CVCAM and MHSAM, which resemble components commonly seen in existing research. While these modules are well-integrated, the paper lacks a deeper level of novelty, as it does not provide mathematical proofs or theoretical insights to support the unique effectiveness of these modules. 2. The evaluation is limited in scope, as it relies on only two datasets (CVOGL and G
The paper is generally well-written and organized. The experimental results seems strong.
- The core components (CVCAM and MHSAM) are largely based on existing techniques like Transformer's cross-attention and basic convolution operations. It is better to Clearer explanation of technical innovations beyond existing methods with theoretical analysis of why the proposed approach works better. - How to justify the motivation of using multiple iterations of cross attention? How sensitive is the performance to the architecture choices? - Analyze what types of cases show improvement vs
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
