Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

arXiv:2506.05087·cs.CV·June 6, 2025

Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

PDF

Open Access

TL;DR

This paper presents a multimodal framework combining vision and language models to assess urban streetscapes, capturing subjective perceptions and providing interpretable diagnostics for inclusive urban planning.

Contribution

It introduces a novel multimodal assessment framework that fuses vision transformers with large language models for interpretable urban perception analysis.

Findings

01

Achieved 0.84 F1 score on objective street features

02

89.3% agreement with resident perceptions

03

Captured context-dependent perceptual contradictions

Abstract

While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUrban Design and Spatial Analysis · Urban Transport and Accessibility · Automated Road and Building Extraction

MethodsDense Connections · Layer Normalization · Vision Transformer