UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science
Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic

TL;DR
This paper introduces a new dataset, training strategy, and benchmark for learning and evaluating spatially grounded multimodal embeddings in urban environments, improving urban understanding tasks.
Contribution
It presents UGData, UGE training method, and UGBench benchmark, advancing spatially grounded urban multimodal embeddings with explicit spatial alignment.
Findings
Up to 44% improvement in image retrieval.
Over 30% gains in geolocation ranking.
Effective spatial grounding for urban tasks.
Abstract
Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Mobility and Location-Based Analysis
