Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization
Teng Wang, Shujuan Fan, Daikun Liu, Changyin Sun

TL;DR
This paper introduces TransGCNN, a novel Transformer-guided CNN architecture that improves cross-view geolocalization accuracy by combining local CNN features with global Transformer representations, achieving state-of-the-art results efficiently.
Contribution
The paper proposes a new Transformer-guided CNN model that enhances feature discrimination for ground-to-aerial geolocalization, with a dual-branch Transformer head for multi-scale global feature integration.
Findings
Achieves top-1 accuracy of 94.12% on CVUSA dataset.
Outperforms baselines with fewer parameters and higher frame rate.
Demonstrates superior accuracy-efficiency tradeoff.
Abstract
Ground-to-aerial geolocalization refers to localizing a ground-level query image by matching it to a reference database of geo-tagged aerial imagery. This is very challenging due to the huge perspective differences in visual appearances and geometric configurations between these two views. In this work, we propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture, which couples CNN-based local features with Transformer-based global representations for enhanced representation learning. Specifically, our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context from the CNN map. In particular, our Transformer head acts as a spatial-aware importance generator to select salient CNN features as the final feature representation. Such a coupling procedure allows us to leverage a lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Dropout · Label Smoothing · Adam · Residual Connection · Absolute Position Encodings · Byte Pair Encoding
