Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision
Yimei Zhang, Guojiang Shen, Kaili Ning, Tongwei Ren, Xuebo Qiu, Mengmeng Wang, Xiangjie Kong

TL;DR
UrbanLN is a novel pre-training framework that enhances urban region representation learning by aligning long captions with visual features and effectively filtering noise using multi-model collaboration and self-distillation.
Contribution
The paper introduces UrbanLN, a new method that improves alignment of long textual descriptions with visual data and mitigates noise in caption generation for urban imagery.
Findings
UrbanLN outperforms existing methods on multiple downstream tasks.
The proposed noise suppression strategies improve robustness in noisy caption scenarios.
Extensive experiments validate the effectiveness across four real-world cities.
Abstract
Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Mobility and Location-Based Analysis · Domain Adaptation and Few-Shot Learning
