Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

Pengrui Quan; Brian Wang; Kang Yang; Liying Han; Mani Srivastava

arXiv:2505.11618·cs.AI·January 13, 2026

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

Pengrui Quan, Brian Wang, Kang Yang, Liying Han, Mani Srivastava

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces STARK, a benchmark for evaluating the spatiotemporal reasoning capabilities of LLMs and LRMs across diverse tasks, revealing strengths and limitations in geometric and world-knowledge reasoning.

Contribution

The paper presents a comprehensive hierarchical benchmark, STARK, for systematically assessing and comparing LLMs and LRMs in complex spatiotemporal reasoning tasks.

Findings

01

LLMs show limited success in geometric reasoning tasks as complexity increases

02

LRMs demonstrate robust performance, often surpassing traditional methods

03

Performance gap narrows in world-knowledge reasoning, with some LLMs outperforming LRMs

Abstract

Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nesl/stark_benchmark
noneOfficial

Datasets

prquan/STARK_10k
dataset· 1.7k dl
1.7k dl

Videos

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges· slideslive

Taxonomy

TopicsConstraint Satisfaction and Optimization · Multimodal Machine Learning Applications · Human Mobility and Location-Based Analysis