UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Jie Feng; Shengyuan Wang; Tianhui Liu; Yanxin Xi; Yong Li

arXiv:2506.23219·cs.CV·July 1, 2025

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li

PDF

Open Access

TL;DR

UrbanLLaVA is a multi-modal large language model tailored for urban intelligence, capable of understanding and reasoning across diverse urban data types, outperforming existing models in various urban tasks.

Contribution

The paper introduces UrbanLLaVA, a novel multi-modal LLM with a specialized urban instruction dataset and a multi-stage training framework for enhanced spatial reasoning and urban task performance.

Findings

01

Outperforms existing MLLMs in urban tasks

02

Demonstrates strong generalization across cities

03

Effectively handles diverse multi-modal urban data

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $UrbanLLaVA$ , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $UrbanLLaVA$ , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Mobility and Location-Based Analysis · Advanced Neural Network Applications