How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Juneyoung Ro; Namwoo Kim; Yoonjin Yoon

arXiv:2508.21565·cs.CV·September 1, 2025

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Juneyoung Ro, Namwoo Kim, Yoonjin Yoon

PDF

Open Access

TL;DR

This paper evaluates how well current vision-language models understand urban scenes, showing that fine-tuning with a synthetic, domain-specific dataset significantly improves their spatial reasoning abilities in city environments.

Contribution

It introduces urban spatial reasoning as a new challenge for VLMs and demonstrates the effectiveness of synthetic datasets for domain adaptation.

Findings

01

Fine-tuning improves performance on urban spatial reasoning tasks.

02

VLMs perform reasonably in zero-shot settings but benefit greatly from domain-specific fine-tuning.

03

Synthetic datasets with Chain-of-Thought supervision enhance model reasoning in urban scenes.

Abstract

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance,…

Tables3

Table 1. Table 1 : Example of extracted metadata for a single street-view image.

Field	Value
Greenery Proportion	0.35
Sky Proportion	0.15
Building Proportion	0.40
Objects	person: 2, car: 5, building: 2
Depth Range	41.5
Closest Object	person
Layout	buildings: left, cars: right
Top Entity	building

Table 2. Table 2: Performance of vision–language models on perceptual QA and compositional QA tasks. Results are reported for zero-shot and fine-tuned settings. Bold indicates the better performance within each model and metric.

Model	Setup	Proportions		Depth			Layout		Object			Compositional
		Binary	Scalar	Categorical	Closest Obj.	Binary	Binary	Top entity	Count	Presence	Co-occ.	Negation	Counterf.	Multihop
		F1↑	MAE↓	F1↑	F1↑	F1↑	F1↑	F1↑	MAE↓	F1↑	F1↑	F1↑	F1↑	F1↑
LLaVA-1.5	Zero-shot	0.59	0.18	0.62	0.24	0.69	0.61	0.09	2.36	0.95	0.99	0.31	0.35	0.36
LLaVA-1.5	Fine-tuned	0.44	0.22	0.50	0.12	0.80	0.58	0.15	3.03	0.97	0.90	0.72	0.33	0.40
InstructBLIP	Zero-shot	0.47	0.21	0.54	0.22	0.99	0.58	0.03	4.33	0.96	0.99	0.53	0.37	0.80
InstructBLIP	Fine-tuned	0.11	0.27	0.62	0.10	0.88	0.45	0.02	4.05	1.00	0.95	0.40	0.48	0.72
BLIP-2	Zero-shot	0.33	0.21	0.11	0.11	0.99	0.58	0.06	4.10	0.97	0.91	0.52	0.55	0.67
BLIP-2	Fine-tuned	0.89	0.11	0.76	0.67	0.89	0.57	0.87	1.70	0.98	1.00	0.91	0.90	0.81

Table 3. Table 3: Percentage change from zero-shot to fine-tuned for each model and task type. Positive values indicate improvement (higher F1 or lower MAE), while negative values indicate performance degradation.

Model

Setup

Proportions (

% Δ

)

Depth (

% Δ

)

Layout (

% Δ

)

Object (

% Δ

)

Compositional (

% Δ

)

Binary

Scalar

Categorical

Closest Obj.

Binary

Top Entity

Count

Presence

Co-occ.

Negation

Counterf.

Multihop

LLaVA-1.5

% Δ

–25.4

–22.2

–19.4

–50.0

+15.9

–4.9

+66.7

–28.4

+2.1

–9.1

+132.3

–5.7

+11.1

InstructBLIP

% Δ

–76.6

–28.6

+14.8

–54.5

–11.1

–22.4

–33.3

+6.5

+4.2

–4.0

–24.5

+29.7

–10.0

BLIP-2

% Δ

+169.7

+47.6

+590.9

+509.1

–10.1

–1.7

+1350.0

+58.5

+1.0

+9.9

+75.0

+63.6

+20.9

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Constraint Satisfaction and Optimization

Full text

How Well Do Vision–Language Models Understand Cities?

A Comparative Study on Spatial Reasoning from Street-View Images

Juneyoung Ro Namwoo Kim Yoonjin Yoon*∗*

Korea Advanced Institute of Science and Technology

{juneyoung, namwoo, yoonjin}@spacetime.kaist.ac.kr

Abstract

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs—BLIP-2, InstructBLIP, and LLaVA-1.5—evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

11footnotetext: Corresponding author: [email protected]

1 Introduction

Understanding street-level urban scenes at fine-grained spatial scales is crucial for informing how cities are designed and experienced. Visual elements such as greenery, skyline openness, and building density significantly influence urban comfort, walkability, and perceived safety [34, 3, 13, 12]. While humans can intuitively grasp these features when viewing street-level imagery, current AI models—particularly those designed for general-purpose visual understanding—often face difficulties in reasoning about spatial relationships and compositional patterns in complex urban environments.

Out-of-Domain Challenge in Urban Scenes. Large-scale models like CLIP [38] excel on images resembling those in their pretraining corpora, yet frequently underperform on domains such as medical scans, satellite imagery, and “natural adversarial” photos that differ in texture, structure, or composition [27, 18]. A recent survey [27] attributes this limitation to the model’s tendency to rely on surface-level keyword associations rather than reasoning about spatial structures and relationships. Urban street scenes are particularly affected by these issues. While they share recurring elements, such as trees and buildings, fine-grained cues such as how much tall buildings block the sky, the canopy area of trees that creates shaded rest spots, or the overall greenery index that signals biophilic quality all convey crucial spatial information. Accurately interpreting these details is vital for evidence-based planning [19], navigation [17], and public-space design [8].

Street-view imagery serves as a primary tool for capturing and analyzing such fine-grained urban features. It has been extensively employed to study human-inhabited urban environments [33, 26]. Recent research has focused on streetscape perception evaluation [37, 29], and the integration of reasoning module may further enhance the depth of visual urban analysis. However, existing studies often emphasize perceptual assessment without systematically incorporating reasoning about spatial relationships. To address this gap, our pipeline generates both perceptual and compositional base QA pairs to assess how well state-of-the-art vision–language models interpret urban scenes. We then transform these base pairs into Chain-of-Thought (CoT) variants that verbalize the underlying reasoning process, supporting evaluation of final answers alongside reasoning fidelity. This design allows even simple yes/no questions to probe fine-grained spatial understanding and the ability to articulate metadata-grounded inferences.

In pursuit of this goal, we introduce a synthetic QA generation pipeline built to fine-tune vision–language models (VLMs) for urban street-scene understanding. Leveraging segmentation, depth estimation, and object detection predictions, we construct thousands of visually grounded questions and answers, providing targeted supervision for spatial and compositional understanding. We use this synthetic dataset to evaluate three off-the-shelf VLMs—BLIP-2 [25], InstructBLIP [11], and LLaVA-1.5 [30]—covering the spectrum from zero-shot transfer models to fully conversational multimodal assistants. These models were selected for their architectural diversity, distinct training strategies, and full open-source availability, which enables transparent inspection, reproducibility, and community-driven improvement. BLIP-2 serves as a strong general-purpose model with frozen vision encoders, InstructBLIP as an instruction-tuned variant optimized for multimodal following, and LLaVA-1.5 as a large-scale, conversational VLM with end-to-end fine-tuning. By comparing each model in its original form and after fine-tuning, we assess whether modest adaptation can yield reliable, human-aligned street-level reasoning. Below are our key research questions and contributions.

Research Questions

•

RQ1 – How well do current vision-language models understand fine-grained spatial relationships in urban street scenes out of the box?

•

RQ2 – How much can targeted fine-tuning on a synthetic but carefully structured domain-specific QA set close this gap?

•

RQ3 – How do model strengths and weaknesses vary across different question types, from perception-based to compositional reasoning tasks?

Our Contributions

•

We conduct a comparative study of BLIP-2, InstructBLIP, and LLaVA-1.5 on fine-grained spatial reasoning in urban street scenes, enabled by a synthetic VQA dataset we construct from street-view images to support both zero-shot and fine-tuned evaluation.

•

We develop a modular QA generation pipeline that produces 280K questions across perceptual, compositional, and CoT formats, enabling diverse and progressively challenging supervision.

•

We perform detailed quantitative and qualitative evaluations by question type, revealing model-specific strengths, weaknesses, and the effects and tradeoffs of fine-tuning at different scales. The code for implementing our pipeline are publicly available at: https://github.com/eeyore22/urban_scope.

2 Related Work

Enhancing reasoning in vision–language models. A growing body of work has focused on improving the reasoning capabilities of vision–language models. Gokhale et al. (2020) introduced the VQA-LOL benchmark to specifically evaluate logical reasoning in VQA, demonstrating that models frequently fail on tasks involving negation and complex logic [15]. To address similar challenges, Niu et al. (2021) proposed a counterfactual VQA framework that mitigates language biases, showing that many VQA models rely on spurious correlations rather than true scene understanding [36]. Zhang et al. (2025) further emphasized that even modern multimodal models still misinterpret negated or hypothetical questions in urban driving scenarios, underscoring persistent reasoning gaps [51].

Synthetic and programmatically generated benchmarks have further advanced reasoning evaluation. Johnson et al. (2017) introduced the CLEVR benchmark, which used synthetic 3D scenes to systematically test compositional and logical reasoning in VQA [21]. Hudson and Manning (2019) developed GQA, a large-scale, programmatically generated VQA dataset that specifically targets multi-step reasoning on real images [20]. Chen et al. (2024) introduced SpatialVLM, which augmented VQA training with depth-aware, spatially grounded questions to improve quantitative spatial reasoning [7]. Wang et al. (2025) developed OmniDrive, which synthesized counterfactual driving scenarios (e.g., “If I decide to accelerate and make a left turn, what could be the consequences?”) to generate decision-oriented QA pairs tailored for driving applications [46].

Urban scene understanding and computer vision foundations. Meanwhile, the computer vision community has developed a wide range of tools and foundational datasets for urban scene understanding. Cityscapes [10] and Mapillary Vistas [35] remain among the most widely used resources in this domain. Cityscapes provides finely annotated urban street scenes and significantly advanced semantic segmentation in complex driving environments. Mapillary Vistas extends this foundation by offering diverse street-level imagery across a broader range of cities, countries, and viewpoints, enhancing generalization beyond uniform city structures.

Ranftl et al. (2020) proposed MiDaS, a cross-dataset monocular depth estimation framework that improved depth generalization across various visual domains, including urban scenes [39]. Caesar et al. (2020) introduced nuScenes, which provided multimodal sensory inputs—such as lidar, radar, and multi-camera setups—for urban driving tasks [5], enabling detailed spatial and temporal scene understanding. These perception pipelines now offer high-quality scene metadata that can support reasoning-centric evaluations in vision–language tasks.

3 Methods

3.1 Street-view Image Collection

We collected 50,000 street view images at a resolution of 512×512 from five global cities—Boston, New York City, Tokyo, Seoul, and Singapore—using the Google Street View API [1]. Sampling locations were systematically selected via OpenStreetMap (OSM) road network coordinates to ensure coverage of publicly accessible streets. At each location, we captured four headings ( $0^{\circ}$ , $90^{\circ}$ , $180^{\circ}$ , and $270^{\circ}$ ) to provide a full panoramic view of the urban context. The dataset spans diverse urban typologies, including downtown cores, residential streets, waterfronts, green corridors, and mixed-use areas, enabling broad generalization across spatial environments.

3.2 Scene Attribute Extraction

For each street-view image $I$ , we extract scene attributes using pretrained models: semantic segmentation, object detection, and monocular depth estimation. From these, we assemble structured metadata of each image such as greenery proportion, object counts, and depth range. This metadata serves as pseudo-ground truth, offering scalable, interpretable supervision for spatial reasoning tasks.

Semantic Segmentation. We use SegFormer [49] pretrained on Cityscapes dataset [10] to obtain pixel-wise class labels and calculate the proportions of greenery, sky, and buildings—often quantified as Green View Index (GVI), Sky View Factor (SVF), and Building View Factor (BVF) [16, 33, 26]. These metrics help characterize visual structure in urban scenes, enabling tasks like detecting dominant vertical arrangements and left–right spatial bias.

Object Detection. DETR ResNet-50 [6] is used to detect key urban scene elements (e.g., pedestrians, cars, bicycles). We extract both counts and bounding box locations, enabling questions about object presence, quantity, and co-occurrence.

Monocular Depth Estimation. Using MiDaS [39], we generate per-pixel depth maps and derive scene-level statistics such as depth range, variance, and object-wise depth averages. These inform reasoning about spatial complexity and object proximity.

Metadata Assembly. The predictions are consolidated into a unified, structured metadata record for each image. This record directly drives base QA pair generation, ensuring that every question is traceable to specific visual evidence in the scene.

Following prior benchmarks such as SpatialVLM [7] and OmniDrive [46], we adopt a synthetic supervision strategy based on pretrained model outputs rather than human annotations. This approach carries the risk of occasional errors such as reduced pixel-level precision, but offers a practical tradeoff: enabling large-scale, consistent QA generation with broad task coverage and reproducibility, thereby supporting more transparent and controlled evaluations of spatial reasoning in vision–language models.

3.3 QA Generation Pipeline

We generate a QA dataset in two main phases: (1) creation of base QA pairs with short factual answers, and (2) transformation of the answers into Chain-of-Thought (CoT) variants that verbalize the underlying reasoning process. Base QA generation covers two broad perspectives: a perceptual perspective, which can be answered directly from a single scene attribute, and a compositional perspective, which requires combining multiple attributes through intermediate logic. Appendix 6.1 details the number of QA pairs per question type, with templates and type-specific rules in Appendices 6.2–6.3.

3.3.1 Base QA Generation

The first phase produces base QA pairs with short, factual answers. All questions are grounded in structured metadata of segmentation, object detection, and depth estimation predictions, and fall into two broad categories:

Perceptual QA.

These questions can be answered directly from a single scene attribute, using deterministic rules and interpretable thresholds. Answers are typically numeric or binary, reflecting raw scene measurements.

•

Proportions: Scalar or binary assessments of the pixel-wise proportions of greenery, sky, and buildings.

•

Depth: Identification of the closest object using depth maps.

•

Layout: Inference of vertical composition and left/right dominance.

•

Objects: Counts, presence, and co-occurrence of urban scene elements.

Perceptual thresholds are drawn from prior studies linking feature proportions (e.g., $\text{GVI}>30\%$ ) to visual dominance and aesthetic appraisal [2, 4, 52, 32, 42]. These perception-level outputs serve as structured evidence for the reasoning and CoT stages, while also enabling standalone evaluation of intuitive visual patterns such as object prominence and spatial asymmetry.

Compositional QA.

These questions require integrating multiple perceptual facts and applying intermediate logic to produce a higher-order answer. The output remains short, such as yes/no, integer, or a single word, but the derivation follows a fixed, question type–specific reasoning rule.

•

Negation: Tests the ability to process exclusions or counter-statements based on perceptual evidence.

•

Counterfactuals: Hypothetical scenarios constructed from plausible alternatives to the observed scene attribute.

•

Multi-hop: Multi-step comparisons and chained logic that traverse multiple perceptual attributes.

Negation and counterfactuals pose well-known challenges to both humans and AI models [14, 24, 22]. Prior work has shown that LLMs often misinterpret these forms [28, 45], underscoring the need for rigorous evaluation of compositional reasoning grounded in explicit perceptual inputs.

3.3.2 Chain-of-Thought QA Transformation

In the second phase, each base QA pair is transformed into a CoT variant by replacing its short factual answer with a step-by-step natural language rationale that reconstructs the original reasoning process. “Thinking-aloud” reasoning has been shown to enhance performance and interpretability in tasks such as arithmetic, commonsense inference, and multi-hop reasoning [48, 23, 47], and has recently been adapted to vision–language settings [50, 40, 9]. In our framework, CoT is not an additional question type but an answer expansion layer applied to both perceptual and compositional QA.

For each QA pair, Gemini 1.5-Flash [44] is prompted with the question, the scene’s metadata (Table 1), and the corresponding reasoning rule from a predefined question type–to–reasoning rule mapping table (shown in Appendix 6.4). The model is instructed to treat the metadata as visual evidence, follow the specified rule, and produce a step-by-step natural language rationale before stating the final answer, ensuring that the explanation remains fully grounded in the scene attributes extracted from Section 3.2. Figure 3 shows an example QA and detailed prompt templates are provided in Appendix 6.4.

3.3.3 Human Validation of Synthetic Supervision

Verifying the quality of generated supervision is essential in synthetic benchmarks, as it impacts the reliability and interpretability of downstream evaluations [41, 31, 43]. Because our QA pipeline uses pretrained models both to extract structured metadata and to produce CoT reasoning traces, we conducted a human evaluation of 500 randomly sampled QA pairs spanning all question types to check for (1) metadata accuracy and (2) CoT answer consistency and plausibility. Judgments were binary, focusing on perceptual plausibility rather than exact numeric precision (e.g., confirming that “3 cars” corresponds to roughly three visible cars, or that a greenery proportion of 0.47 appears visually reasonable).

Key results indicate that predictions from segmentation and depth estimation exhibited high accuracy (95% and 94%, respectively), whereas object detection showed comparatively more errors (88%), primarily due to over-counting objects not visible to human annotators. CoT reasoning was largely consistent with predefined rules (98% for consistency); however, in certain cases, the direct transformation of rule-based logic did not yield a fully plausible description of the scene’s complexity (90% for plausibility). Appendix 6.5 provides the full results with representative examples of both correct and incorrect cases.

4 Evaluation

4.1 Experimental Settings

We evaluate each question in both zero-shot and fine-tuned settings across BLIP-2 (Flan-T5-xl), InstructBLIP (Flan-T5-xl), and LLaVA-1.5-7B to isolate the effect of task-specific adaptation.

•

Zero-shot: Evaluation on the synthetic VQA dataset without any fine-tuning, measuring each model’s inherent ability to perform spatial reasoning in urban scenes.

•

Fine-tuned: Evaluation after fine-tuning on the synthetic VQA dataset, measuring how each model’s spatial reasoning performance changes with targeted supervision.

All models are fine-tuned using a batch size of 32, for 40 epochs, with a learning rate of 1e-4 and AdamW optimizer. The dataset is split into training, validation, and test sets with a 7:2:1 ratio, and we apply the same splits and hyperparameters across all models for fair comparison. Details regarding the metrics, answer parsing logic, and prompt constraints are in the Appendix 6.6.

4.2 Quantitative Evaluation

4.2.1 Heterogeneous Performance Trends

When we compare zero‐shot performance to CoT fine‐tuning across all task types, three distinct patterns emerge. Significant improvement group includes tasks such as counterfactual reasoning, negation reasoning, and depth‐categorical questions. For example, as shown in Table 3, BLIP-2 shows a remarkable 509% gain in depth-closest object and a 591% improvement in depth-categorical questions. Negation and counterfactual reasoning tasks also see substantial improvements across models, with BLIP-2 achieving a 75% increase in negation and a 64% increase in counterfactual reasoning. These gains suggest that a few thousand in-domain examples can rapidly equip models to address reasoning gaps in urban scenes. Marginal improvement group—most notably object presence and multihop reasoning—shows more modest gains, with BLIP-2 improving by around 1% in object presence and 21% in multihop reasoning. Similarly, LLaVA-1.5 demonstrates a slight 11% improvement in multihop tasks and minimal gains in object presence. This suggests that fine-tuning aids logical chaining but may not fully overcome the inherent challenges in these complex, compositional tasks.

Performance degradation group includes tasks such as object co-occurrence, layout binary, and proportion binary, where models frequently show drops after fine-tuning. For instance, LLaVA-1.5 and InstructBLIP experience a 9–28% decrease in object co-occurrence and proportion binary, suggesting that adaptation to in-domain dataset can sometimes erode zero-shot strengths in simpler perceptual tasks. One plausible explanation is catastrophic forgetting, where fine-tuning on a dataset dominated by compositional and reasoning-heavy samples shifts the model’s feature representations away from low-level perceptual cues. A second factor could be data distribution mismatch, as our fine-tuning set contains relatively fewer straightforward perception cases. Our fine-tuning also emphasizes rule-based reasoning traces, which could draw attention toward abstract scene logic at the expense of rapid, single-step perceptual judgments. These findings suggest pairing domain-specific adaptation with strategies such as rehearsal data, multi-task balancing, or selective freezing to preserve perceptual competence while boosting reasoning performance.

4.2.2 Model Efficiency and Practical Trade-offs

As shown in Figure 5, BLIP-2 offers the most parameter-efficient gains, achieving substantial perception and reasoning improvements despite being the smallest model in this comparison. In contrast, LLaVA-1.5 specializes in reasoning tasks, showing strong reasoning gains with minimal parameter overhead but a noticeable decline in perception tasks, such as a 25% drop in proportion binary and a 28% increase in proportion scalar MAE. Notably, LLaVA-1.5’s robust zero-shot performance makes it an appealing option for deployment in settings with minimal or no task-specific supervision. InstructBLIP, despite its larger model size, demonstrates limited parameter efficiency and suffers from perceptual performance degradation across several tasks, including a 77% decrease in proportion binary and a 29% increase in scalar MAE, suggesting that more targeted fine-tuning strategies may be required to fully leverage its potential. Taken together, these results emphasize the need for careful model selection based on the target domain and computational constraints. BLIP-2 emerges as an effective, lightweight model for domain-specific fine-tuning pipelines like ours, providing strong perception and reasoning gains with low computational cost, while LLaVA-1.5 remains a valuable option in reasoning-focused applications, especially when fine-tuning budgets are limited and zero-shot robustness is critical.

4.3 Qualitative Evaluation

Beyond numerical performance, our qualitative evaluation highlights that different models exhibit distinct strengths across question types. Notably, BLIP-2 consistently demonstrates versatility and robust reasoning capabilities across a wide range of tasks. For example, as illustrated in Figure 6, in a representative negation-type question, fine-tuning enhances reasoning quality for both InstructBLIP and BLIP-2 compared to their zero-shot counterparts. However, while InstructBLIP tends to remain at the level of question repetition—stating that “the statement says that neither buildings nor sky dominates this scene”—BLIP-2 engages in more explicit logical reasoning, accurately resolving the double-negative structure of the question by deducing: “while neither the sky nor the building takes up the majority of the image, they are both present and visible. Therefore, it’s false to say that neither dominates the scene.”

While LLaVA-1.5 exhibits comparatively lower average performance, as indicated by its F1 scores in Table 2, it nonetheless produces occasional high-quality answers. We report one example of a depth task in Figure 6. The model consistently provides detailed and contextually appropriate assessments of scene depth, even when other models struggle. Additionally, InstructBLIP shows notable proficiency in counterfactual reasoning. Fine-tuning substantially improves its ability to perform multi-step hypothetical reasoning beyond surface-level descriptions, as shown in the counterfactual example in the same figure.

Collectively, these qualitative observations complement the quantitative results, indicating that while fine-tuning with a domain-specific synthetic QA dataset generally improves model performance, each model retains distinct strengths more apparent through qualitative analysis.

5 Conclusions and Future Work

In this study, we evaluated the ability of general-purpose vision-language models to understand fine-grained spatial relationships in street-view images. By introducing a structured pipeline for generating diverse, spatially grounded QA tasks, our work establishes a new problem domain in VL research and creates opportunities for advancing domain-specific perception and reasoning.

Comprehensive experiment results show that fine-tuning with our synthetic QA dataset leads to substantial performance gains. Lightweight models like BLIP-2 particularly benefit from this structured supervision, achieving gains in perception and reasoning capabilities with minimal task-specific data. These findings highlight the potential of synthetic QA with CoT supervision as a versatile approach for enhancing spatial understanding in urban scenes and other domain-specific contexts. For larger models like LLaVA-1.5, our study suggests that synthetic QA alone may be insufficient to shift its pretrained distribution. Addressing this will likely require more complex QA structures and advanced instruction tuning methods, presenting a key direction for future research.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2022M3J6A1063021, No.RS-2025-00517342).

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anguelov et al. [2010] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. Google street view: Capturing the world at street level. Computer , 43(6):32–38, 2010.
2Aoki [1991] Yoji Aoki. Evaluation methods for landscapes with greenery. Landscape Research , 16:3–6, 1991.
3Bardhan et al. [2024] Mondira Bardhan, Fu Li, Mathew H.E.M. Browning, Jiaying Dong, Kuiran Zhang, Shuai Yuan, Hüseyin Ertan İnan, Olivia Mc Anirlin, Dani T. Dagan, Allison Maynard, Katie Thurson, Fan Zhang, Ruoyu Wang, and Marco Helbich. From space to street: A systematic review of the associations between visible greenery and bluespace in street view imagery and mental health. Environmental Research , 2024.
4Bolte et al. [2024] Anna-Maria Bolte, Benjamin Niedermann, Thomas Kistemann, Jan-Henrik Haunert, Youness Dehbi, and Theo Kötter. The green window view index: automated multi-source visibility analysis for a multi-scale assessment of green window views. Landscape Ecology , 39(3):71, 2024.
5Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020.
6Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. 2020.
7Chen et al. [2024 a] Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In CVPR , 2024 a.
8Chen et al. [2024 b] Mingze Chen, Yuxuan Cai, Shuying Guo, Ruilin Sun, Yang Song, and Xiwei Shen. Evaluating implied urban nature vitality in san francisco: An interdisciplinary approach combining census data, street view images, and social media analysis. Urban Forestry & Urban Greening , 95:128289, 2024 b.