STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

TL;DR
STT-Arena is a new benchmark with 227 realistic interactive tasks designed to evaluate and improve large language models' ability to adapt to spatio-temporal disruptions in dynamic environments.
Contribution
The paper introduces STT-Arena, a comprehensive benchmark for spatio-temporal dynamic reasoning, and proposes an iterative refinement and RL approach to enhance LLM performance.
Findings
State-of-the-art models achieve less than 40% accuracy on STT-Arena.
Identified three common failure modes: Stale-State Execution, Misdiagnosis, Missing Verification.
Proposed STT-Agent-4B outperforms existing models on the benchmark.
Abstract
Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
