Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization
Shuhan Hu, Yiru Li, Yuanyuan Li, Yingying Zhu

TL;DR
This paper introduces EDGeo, a novel framework for cross-view object geo-localization that uses mask-based positional encoding and strip convolutional context modeling to improve accuracy and robustness in challenging scenarios.
Contribution
It proposes a mask-based positional encoding scheme and a strip convolutional context module, advancing beyond keypoint-based methods for better shape and context understanding.
Findings
Achieves state-of-the-art localization accuracy on public datasets.
Improves robustness to annotation shifts and large-span objects.
Enhances feature discrimination with strip convolutional kernels.
Abstract
Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
