CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Tianhui Liu; Hetian Pang; Xin Zhang; Jie Feng; Yong Li; Pan Hui

arXiv:2510.22282·cs.CV·October 28, 2025

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui

PDF

TL;DR

CityRiSE leverages reinforcement learning to enhance vision-language models for accurate, interpretable, and generalizable urban socio-economic status prediction using multi-modal visual data.

Contribution

This paper introduces CityRiSE, a reinforcement learning framework that guides LVLMs to focus on meaningful visual cues for socio-economic prediction, improving accuracy and interpretability.

Findings

01

Outperforms existing baselines in accuracy and generalization

02

Enables reasoning on unseen cities and indicators

03

Demonstrates the effectiveness of RL in guiding LVLMs

Abstract

Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.