USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Siqi Lai; Yansong Ning; Zirui Yuan; Zhixi Chen; Hao Liu

arXiv:2505.17572·cs.AI·May 26, 2025

USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, Hao Liu

PDF

TL;DR

USTBench is a comprehensive benchmark designed to evaluate and analyze the spatiotemporal reasoning abilities of large language models as urban agents across multiple tasks and dimensions, revealing their strengths and limitations.

Contribution

This paper introduces USTBench, the first benchmark for detailed process-level and task-level evaluation of LLMs' urban spatiotemporal reasoning capabilities.

Findings

01

LLMs show potential in urban decision-making tasks.

02

Long-horizon planning and reflection remain challenging for LLMs.

03

Advanced reasoning models do not always outperform non-reasoning models.

Abstract

Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.