Spatially-Weighted CLIP for Street-View Geo-localization
Ting Han, Fengjiao Li, Chunsong Chen, Haoling Huang, Yiping Chen, Meiliu Wu

TL;DR
SW-CLIP introduces a spatially-aware contrastive learning framework for street-view geo-localization, leveraging geographic relationships to improve accuracy and spatial coherence over traditional CLIP methods.
Contribution
The paper presents SW-CLIP, which incorporates spatial autocorrelation into vision-language contrastive learning using distance-aware supervision and neighborhood regularization.
Findings
SW-CLIP outperforms standard CLIP in geo-localization accuracy.
It reduces long-tail localization errors.
Enhances spatial coherence in embedding space.
Abstract
This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
