UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li

TL;DR
UrbanLLaVA is a multi-modal large language model tailored for urban intelligence, capable of understanding and reasoning across diverse urban data types, outperforming existing models in various urban tasks.
Contribution
The paper introduces UrbanLLaVA, a novel multi-modal LLM with a specialized urban instruction dataset and a multi-stage training framework for enhanced spatial reasoning and urban task performance.
Findings
Outperforms existing MLLMs in urban tasks
Demonstrates strong generalization across cities
Effectively handles diverse multi-modal urban data
Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Mobility and Location-Based Analysis · Advanced Neural Network Applications
