Zero-shot Vision-Language Reranking for Cross-View Geolocalization
Yunus Talha Erzurumlu, John E. Anderson, William J. Shuart, Charles Toth, Alper Yilmaz

TL;DR
This paper explores using zero-shot vision-language models as rerankers in cross-view geolocalization, demonstrating pairwise comparison strategies improve Top-1 accuracy over baseline retrieval methods.
Contribution
It introduces a two-stage framework with pairwise VLM reranking, showing its effectiveness over pointwise methods in CVGL tasks.
Findings
Pairwise reranking with VLMs improves Top-1 accuracy.
Pointwise methods cause performance drops or no change.
VLMs are better at relative visual judgment than absolute relevance scoring.
Abstract
Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
