Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
HaoTian Lan

TL;DR
This paper presents a multimodal framework combining vision and language models to assess urban streetscapes, capturing subjective perceptions and providing interpretable diagnostics for inclusive urban planning.
Contribution
It introduces a novel multimodal assessment framework that fuses vision transformers with large language models for interpretable urban perception analysis.
Findings
Achieved 0.84 F1 score on objective street features
89.3% agreement with resident perceptions
Captured context-dependent perceptual contradictions
Abstract
While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUrban Design and Spatial Analysis · Urban Transport and Accessibility · Automated Road and Building Extraction
MethodsDense Connections · Layer Normalization · Vision Transformer
