WalkCLIP: Multimodal Learning for Urban Walkability Prediction
Shilong Xiang, JangHyeon Lee, Min Namgung, Yao-Yi Chiang

TL;DR
WalkCLIP is a multimodal framework that combines satellite, street view, and population data to accurately predict urban walkability, addressing limitations of single-source assessments.
Contribution
This paper introduces WalkCLIP, a novel multimodal approach integrating visual and behavioral data for improved urban walkability prediction.
Findings
Outperforms unimodal and multimodal baselines in accuracy
Effective integration of visual and behavioral signals
Reliable predictions across diverse urban locations
Abstract
Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUrban Transport and Accessibility · Human Mobility and Location-Based Analysis · Urban Green Space and Health
