CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, Pan Hui

TL;DR
CityLens is a comprehensive benchmark that evaluates large vision-language models' ability to predict urban socioeconomic indicators from satellite and street view images across 17 cities, revealing their strengths and limitations.
Contribution
This work introduces CityLens, the largest socioeconomic benchmark for LVLMs with diverse tasks, datasets, and geographic coverage, enabling systematic evaluation and diagnosis of model capabilities.
Findings
LVLMs show promising perceptual and reasoning skills.
Models still face limitations in accurately predicting socioeconomic indicators.
CityLens offers a unified framework for model evaluation and improvement.
Abstract
Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce \textit{CityLens}, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. These make CityLens the most extensive socioeconomic…
Peer Reviews
Decision·ICLR 2026 Poster
1. CityLens introduces the most extensive benchmark to date for evaluating LVLMs on urban socioeconomic prediction, with unprecedented coverage across 17 globally distributed cities, 11 diverse tasks, and 6 critical socioeconomic domains. This scale and diversity significantly advance beyond prior urban vision or geospatial AI benchmarks. 2. Beyond reporting performance numbers, the paper offers thoughtful discussion on why LVLMs struggle—such as lack of numerical grounding, sensitivity to visu
Please refer to questions.
1. This paper proposes CityLens, a large-scale and multi-domain benchmark designed to evaluate the performance of LVLMs on urban socioeconomic indicator prediction tasks. It represents a pioneering attempt in this research area. 2. The dataset covers 17 cities, 11 socioeconomic indicators, and 17 LVLMs, with a large experimental scale and carefully designed data collection, mapping, and preprocessing pipelines. The paper introduces three distinct evaluation paradigms—Direct Metric Prediction, No
1. The results in the Direct Metric Prediction and Normalized Estimation sections (with most tasks showing R² below 0.2 or even negative) clearly demonstrate the severe limitations of current LVLMs in numerical prediction tasks. From a methodological perspective, relying on prompting to have LVLMs directly output numerical values may be inherently unstable. Such a mechanism is fundamentally ill-suited for precise regression tasks. 2. The current experiments primarily focus on comparing multiple
1. The curated CityLens dataset is a major contribution to the community. 2. The benchmarking results are comprehensive and have a large coverage of state-of-the-art LVLMs. 3. The use of three evaluation paradigms (Direct, Normalized, Feature-Based) is technically sound. 4. The paper is well organized and easy to follow.
1. The evaluation is primarily zero-shot or few-shot. A natural question is how much these models could improve if fine-tuned on the CityLens dataset. While the authors mention this as a future direction, a preliminary fine-tuning experiment on one or two models would have strengthened the paper by establishing a potential performance upper bound. 2. While the paper diagnoses what models struggle with (e.g., mental health, life expectancy), a more detailed qualitative analysis of why could be be
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis
