Cross-view Geo-localization with Evolving Transformer
Hongji Yang, Xiufan Lu, Yingying Zhu

TL;DR
This paper introduces EgoTR, a Transformer-based model for cross-view geo-localization that leverages self-attention and a novel self-cross attention mechanism to improve global dependency modeling and geometric understanding, outperforming existing CNN-based methods.
Contribution
The paper presents a new evolving geo-localization Transformer with self-cross attention, enhancing global dependency modeling and geometric correspondence in cross-view geo-localization tasks.
Findings
EgoTR outperforms state-of-the-art methods on multiple datasets.
Self-cross attention improves training stability and generalization.
Transformer-based approach reduces reliance on strong geometric assumptions.
Abstract
In this work, we address the problem of cross-view geo-localization, which estimates the geospatial location of a street view image by matching it with a database of geo-tagged aerial images. The cross-view matching task is extremely challenging due to drastic appearance and geometry differences across views. Unlike existing methods that predominantly fall back on CNN, here we devise a novel evolving geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies, thus significantly decreasing visual ambiguities in cross-view geo-localization. We also exploit the positional encoding of Transformer to help the EgoTR understand and correspond geometric configurations between ground and aerial images. Compared to state-of-the-art methods that impose strong assumption on geometry knowledge, the EgoTR flexibly learns the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dropout · Label Smoothing
