CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Tianhui Liu; Hetian Pang; Xin Zhang; Tianjian Ouyang; Zhiyuan Zhang; Jie Feng; Yong Li; Pan Hui

arXiv:2506.00530·cs.AI·March 3, 2026

CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, Pan Hui

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

CityLens is a comprehensive benchmark that evaluates large vision-language models' ability to predict urban socioeconomic indicators from satellite and street view images across 17 cities, revealing their strengths and limitations.

Contribution

This work introduces CityLens, the largest socioeconomic benchmark for LVLMs with diverse tasks, datasets, and geographic coverage, enabling systematic evaluation and diagnosis of model capabilities.

Findings

01

LVLMs show promising perceptual and reasoning skills.

02

Models still face limitations in accurately predicting socioeconomic indicators.

03

CityLens offers a unified framework for model evaluation and improvement.

Abstract

Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce \textit{CityLens}, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. These make CityLens the most extensive socioeconomic…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. CityLens introduces the most extensive benchmark to date for evaluating LVLMs on urban socioeconomic prediction, with unprecedented coverage across 17 globally distributed cities, 11 diverse tasks, and 6 critical socioeconomic domains. This scale and diversity significantly advance beyond prior urban vision or geospatial AI benchmarks. 2. Beyond reporting performance numbers, the paper offers thoughtful discussion on why LVLMs struggle—such as lack of numerical grounding, sensitivity to visu

Weaknesses

Please refer to questions.

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper proposes CityLens, a large-scale and multi-domain benchmark designed to evaluate the performance of LVLMs on urban socioeconomic indicator prediction tasks. It represents a pioneering attempt in this research area. 2. The dataset covers 17 cities, 11 socioeconomic indicators, and 17 LVLMs, with a large experimental scale and carefully designed data collection, mapping, and preprocessing pipelines. The paper introduces three distinct evaluation paradigms—Direct Metric Prediction, No

Weaknesses

1. The results in the Direct Metric Prediction and Normalized Estimation sections (with most tasks showing R² below 0.2 or even negative) clearly demonstrate the severe limitations of current LVLMs in numerical prediction tasks. From a methodological perspective, relying on prompting to have LVLMs directly output numerical values may be inherently unstable. Such a mechanism is fundamentally ill-suited for precise regression tasks. 2. The current experiments primarily focus on comparing multiple

Reviewer 03Rating 6Confidence 3

Strengths

1. The curated CityLens dataset is a major contribution to the community. 2. The benchmarking results are comprehensive and have a large coverage of state-of-the-art LVLMs. 3. The use of three evaluation paradigms (Direct, Normalized, Feature-Based) is technically sound. 4. The paper is well organized and easy to follow.

Weaknesses

1. The evaluation is primarily zero-shot or few-shot. A natural question is how much these models could improve if fine-tuned on the CityLens dataset. While the authors mention this as a future direction, a preliminary fine-tuning experiment on one or two models would have strengthened the paper by establishing a potential performance upper bound. 2. While the paper diagnoses what models struggle with (e.g., mental health, life expectancy), a more detailed qualitative analysis of why could be be

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Mobility and Location-Based Analysis