UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua, Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li

TL;DR
UrBench is a new comprehensive benchmark for evaluating large multimodal models in complex multi-view urban scenarios, revealing current models' limitations in urban understanding tasks and cross-view relations.
Contribution
The paper introduces UrBench, a large-scale, multi-view urban benchmark with diverse tasks and data from 11 cities, enabling thorough evaluation of LMMs in urban environments.
Findings
Current LMMs perform significantly worse than humans in urban tasks.
Even the best models lag behind humans by 17.4% on average.
Models show inconsistent behavior across different urban views.
Abstract
Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGeographic Information Systems Studies · Human Mobility and Location-Based Analysis
