TL;DR
AddressVLM enhances large vision-language models for fine-grained street-level address localization by integrating satellite and street-view images through cross-view alignment, significantly improving accuracy in urban environments.
Contribution
The paper introduces a novel cross-view alignment tuning method and a two-stage training protocol for LVLMs to achieve precise street-level address localization.
Findings
AddressVLM outperforms existing LVLMs by over 9% and 12% in accuracy.
Constructed two new street-view VQA datasets for Pittsburgh and San Francisco.
Demonstrated effectiveness of cross-view matching in fine-grained localization.
Abstract
Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The motivation is clearly stated. 2. The experimental results show the effectiveness of the proposed method. 3. Creating new datasets for the research community.
1.The paper uses a pre-trained LVLM to generate textual labels that explain why a street-view image matches a satellite image location. However, any errors or biases in these generated labels become training data for the cross-view alignment tuning stage. These errors could be amplified during the subsequent address localization tuning stage. 2.While improvements are shown, absolute performance is still below specialized discriminative models. 3.Limited evaluation on cities outside the US. 4.
S1. This paper is very well-written, and the figures are clear. As someone with little background in image address localization, I appreciate the straightforward presentation and review of related work, including Figure 1 which puts the methods of existing work side-by-side with the method in AddressVLM. S2. The ablation study is thorough, reporting results on different model variants during each training stage.
W1. I would have liked to see evaluations on how AddressVLM does on other related geolocalization benchmarks adapted for VQA in the same way that Pitts-VQA and SF-Base-VQA were created. For example, comparing performance on OpenStreetView-5M [1] or Geoguessr data like in [2] would better show how this method specifically improves fine-grained address localization and the side effects it has on other related tasks (e.g., does this method detract from more coarse, global understanding?). I think t
1. It adopts a cross-view alignment fine-tuning strategy by aligning the sparsely collected street view images with globally consistent satellite images, which enhances the LVLM's understanding of the overall city street distribution. This helps address the challenges that cannot be solved by only using second-stage address localization fine-tuning. 2. Compared to general LVLM, this method can achieve fine-grained understanding of the urban environment using only 4B parameters, providing feasib
1. Although an innovative method for cross-view alignment of street-view images and satellite images was proposed, there is a lack of theoretical analysis and mathematical derivation of this method, which makes it difficult to deeply understand its principles and limitations. 2. The experimental part is only evaluated in a limited urban area, which cannot fully verify the applicability and scalability of this method in a wider urban environment.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
