AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang,, Jieping Ye

TL;DR
AddressCLIP introduces an end-to-end vision-language framework for city-wide image address localization, leveraging image-text alignment and spatial constraints to improve accuracy over traditional methods.
Contribution
The paper presents AddressCLIP, a novel end-to-end approach for image address localization that combines contrastive learning with spatial manifold constraints, and provides new datasets for this task.
Findings
Outperforms existing transfer learning methods on new datasets
Achieves high accuracy in city-wide address localization
Demonstrates effectiveness through extensive ablations and visualizations
Abstract
In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) image-geography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Indoor and Outdoor Localization Technologies
MethodsALIGN
