GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
Manu S Pillai, Mamshad Nayeem Rizve, Mubarak Shah

TL;DR
GAReT is a transformer-based approach for cross-view video geolocalization that does not rely on camera or odometry data, using novel modules to improve efficiency and temporal consistency, achieving state-of-the-art results.
Contribution
The paper introduces GAReT, a fully transformer-based CVGL method with GeoAdapter and TransRetriever modules, eliminating the need for camera and odometry data and enhancing temporal consistency.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Does not require camera or odometry data, reducing reliance on additional sensors.
Improves temporal consistency of GPS trajectories.
Abstract
Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsGreedy Policy Search
