Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li

TL;DR
This paper introduces SocioSeg, a new dataset and SocioReasoner, a vision-language reasoning framework, to improve the segmentation of socially defined urban entities in satellite imagery, surpassing existing models and enabling zero-shot generalization.
Contribution
The work presents a novel dataset and a reasoning framework that incorporates cross-modal recognition and reinforcement learning for socio-semantic segmentation in satellite images.
Findings
Outperforms state-of-the-art models in socio-semantic segmentation
Demonstrates strong zero-shot generalization capabilities
Provides open-source dataset and code for further research
Abstract
As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic…
Peer Reviews
Decision·ICLR 2026 ConditionalPoster
1. The SocioSeg dataset uses a hierarchical structure from Socio-name to Socio-function and unifies diverse geospatial data into a single map, enabling easier multi-modal reasoning. 2. The SocioReasoner framework simulates human-like annotation with sequential localization and refinement and integrates vision-language models with SAM under reinforcement learning.
1. The main contribution of this paper lies in the introduction of a new dataset, with the primary performance improvements stemming from the dataset and reinforcement learning. Overall, the innovation appears to be limited. 2. The evaluation lacks per-category breakdown and qualitative error analysis so it is unclear which entities are actually improved. 3. The paper shows quantitative improvements but lacks qualitative failure analysis or discussion of where the model fails. 4. Stronger ab
1. Novel dataset. The paper introduces SocioSeg, a new dataset for urban socio-semantic segmentation, focusing on identifying socially defined categories (e.g., schools, parks, hospitals) from satellite imagery. 2. Methodological proposal. The proposed SocioReasoner framework combines vision-language reasoning and reinforcement learning to tackle the proposed semantic segmentation task. The approach is conceptually interesting and aligns with current research trends in multimodal and reasoning-
1. **Lack of justification for the new task.** While the paper motivates the challenge of detecting socially meaningful categories, it does not convincingly articulate the concrete real-world relevance or practical need for socio-semantic segmentation. The connection to downstream applications (e.g., urban analytics, policy planning, or social impact studies) could be made more explicit. Moreover, it is not clear how this socio-semantic classes are defined and they seem quite arbitrary. Is there
- **(S1)**: this paper focuses on an interesting interaction of vision-language/reasoning and remote sensing/earth observation. - **(S2)**: this paper is easy to read and follow, given the systematic build-up of the paper. - **(S3)**: I appreciate the author's contribution in curating and annotating the earth observation and remote sensing dataset.
- **(W1)**: Methodological novelty of the presented approach: this work combines several components of previously published work to address a new task for a specific domain (being remote sensing and earth observation). However, I see the introduced dataset being a contribution, which doesn't change the methodological novelty of the presented approach. - **(W2)**: Mismatch and relevance to the ICLR community: I think this work might be more suitable for publication at a computer vision conferen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
