VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Parth Parag Kulkarni; Rohit Gupta; Prakash Chandra Chhipa; Mubarak Shah

arXiv:2604.12159·cs.CV·April 15, 2026

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Parth Parag Kulkarni, Rohit Gupta, Prakash Chandra Chhipa, Mubarak Shah

PDF

1 Repo

TL;DR

VidTAG introduces a dual-encoder framework with temporal alignment modules for precise, fine-grained global video geolocalization, outperforming existing methods in trajectory consistency and accuracy.

Contribution

The paper presents VidTAG, a novel approach combining self-supervised and language-aligned features with temporal modules for improved global video geolocalization.

Findings

01

Achieves 20% improvement at 1 km threshold over GeoCLIP.

02

Outperforms state-of-the-art by 25% on CityGuessr68k.

03

Generates temporally consistent trajectories in diverse datasets.

Abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://parthpk.github.io/vidtag_webpage
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.