# How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

**Authors:** Juneyoung Ro, Namwoo Kim, Yoonjin Yoon

arXiv: 2508.21565 · 2025-09-01

## TL;DR

This paper evaluates how well current vision-language models understand urban scenes, showing that fine-tuning with a synthetic, domain-specific dataset significantly improves their spatial reasoning abilities in city environments.

## Contribution

It introduces urban spatial reasoning as a new challenge for VLMs and demonstrates the effectiveness of synthetic datasets for domain adaptation.

## Key findings

- Fine-tuning improves performance on urban spatial reasoning tasks.
- VLMs perform reasonably in zero-shot settings but benefit greatly from domain-specific fine-tuning.
- Synthetic datasets with Chain-of-Thought supervision enhance model reasoning in urban scenes.

## Abstract

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21565/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21565/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/2508.21565/full.md

---
Source: https://tomesphere.com/paper/2508.21565