CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li

TL;DR
CityBench is a comprehensive evaluation platform for assessing large language and vision-language models on diverse urban tasks, highlighting their strengths in understanding urban semantics and their limitations in professional and numerical tasks.
Contribution
This paper introduces CityBench, the first systematic benchmark for evaluating LLMs and VLMs in urban research, integrating urban data and simulation for diverse task assessment.
Findings
LLMs perform well on semantic and commonsense urban tasks.
LLMs struggle with geospatial prediction and traffic control.
Advanced models show competitive performance in urban understanding.
Abstract
As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design \textit{CityBench}, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build \textit{CityData} to integrate the…
Peer Reviews
Decision·Submitted to ICLR 2025
**(S1)**: A comprehensive benchmark with multiple interactive and non-interactive tasks. The authors provide a benchmark that has a good diversity of tasks to evaluate VLM and LLM capabilities. The tasks that form this benchmark are valuable to the research community and for future LLM developers. **(S2)**: Extensive evaluations with popularly used large language models. The experiments break down the performance for each task for multiple LLMs with sensible metrics. (although I think Llama 405
**(W1)**: Template based questions. The authors specify in section 3.3.1 that they use LLMs to generate instructions for tasks and also use LLMs to filter out low-quality data. More details are needed here, such as which LLMs are used, whether a mixture of LLMs are used, and how much this biases the downstream benchmark performance in favor of certain models. In general, more specific details on the quality control process are required (eg: how much low-quality data was generated, how much was m
1. Open Source: The authors have developed CityBench and CitySim, which involve a substantial amount of engineering effort, and have also made the codebase open-source. This is beneficial for future research in this domain. 2. Rich geographic diversity: CityBench covers 13 cities, providing a rich geographic diversity.
1. Figure presentation issues: The left panel of Figure 5 aims to showcase differences in performance across cities, but lacks a legend to identify different cities. 2. Experiment: 1) Incomplete and lacking insight in Error Analysis: The error analysis section only provides insights into LLM errors, with no analysis of VLM errors. Additionally, the LLM error analysis merely highlights general issues common across benchmarks, such as instruction-following limitations and hallucinations, without o
1. It’s exciting to see the promising potential of methods based on interactive simulators in the field of LLMs for urban applications; this work can provide a meaningful boost to community development. 2. The workload is substantial, covering a wide range of tasks with detailed and thorough experiments.
1. The paper provides a somewhat vague description of the pipeline. For example, in Section 3.3.1, what are the criteria for identifying low-quality data? What are the standards for filtering and rewriting? If data quality remains unsatisfactory after filtering and rewriting, is it filtered again or further processed in the same manner? In Section 3.1 CityData, the authors state that OSM data is unsuitable for city data construction and introduce a globally applicable rule-based map con
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
