UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Jie Zhang; Xingtong Yu; Yuan Fang; Rudi Stouffs; Zdravko Trivic

arXiv:2602.08342·cs.CV·February 10, 2026

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic

PDF

Open Access

TL;DR

This paper introduces a new dataset, training strategy, and benchmark for learning and evaluating spatially grounded multimodal embeddings in urban environments, improving urban understanding tasks.

Contribution

It presents UGData, UGE training method, and UGBench benchmark, advancing spatially grounded urban multimodal embeddings with explicit spatial alignment.

Findings

01

Up to 44% improvement in image retrieval.

02

Over 30% gains in geolocation ranking.

03

Effective spatial grounding for urban tasks.

Abstract

Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Mobility and Location-Based Analysis