Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Yimei Zhang; Guojiang Shen; Kaili Ning; Tongwei Ren; Xuebo Qiu; Mengmeng Wang; Xiangjie Kong

arXiv:2511.07062·cs.AI·December 2, 2025

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Yimei Zhang, Guojiang Shen, Kaili Ning, Tongwei Ren, Xuebo Qiu, Mengmeng Wang, Xiangjie Kong

PDF

Open Access 1 Video

TL;DR

UrbanLN is a novel pre-training framework that enhances urban region representation learning by aligning long captions with visual features and effectively filtering noise using multi-model collaboration and self-distillation.

Contribution

The paper introduces UrbanLN, a new method that improves alignment of long textual descriptions with visual data and mitigates noise in caption generation for urban imagery.

Findings

01

UrbanLN outperforms existing methods on multiple downstream tasks.

02

The proposed noise suppression strategies improve robustness in noisy caption scenarios.

03

Extensive experiments validate the effectiveness across four real-world cities.

Abstract

Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Mobility and Location-Based Analysis · Domain Adaptation and Few-Shot Learning