TL;DR
VidTAG introduces a dual-encoder framework with temporal alignment modules for precise, fine-grained global video geolocalization, outperforming existing methods in trajectory consistency and accuracy.
Contribution
The paper presents VidTAG, a novel approach combining self-supervised and language-aligned features with temporal modules for improved global video geolocalization.
Findings
Achieves 20% improvement at 1 km threshold over GeoCLIP.
Outperforms state-of-the-art by 25% on CityGuessr68k.
Generates temporally consistent trajectories in diverse datasets.
Abstract
The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
