STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Tingfeng Hui; Hao Xu; Pengyu Zhu; Hongsheng Xin; Kun Zhan; Sen Su; Chunxiao Liu; Ning Miao

arXiv:2605.18548·cs.CL·May 19, 2026

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

PDF

2 Models 1 Datasets

TL;DR

STT-Arena is a new benchmark with 227 realistic interactive tasks designed to evaluate and improve large language models' ability to adapt to spatio-temporal disruptions in dynamic environments.

Contribution

The paper introduces STT-Arena, a comprehensive benchmark for spatio-temporal dynamic reasoning, and proposes an iterative refinement and RL approach to enhance LLM performance.

Findings

01

State-of-the-art models achieve less than 40% accuracy on STT-Arena.

02

Identified three common failure modes: Stale-State Execution, Misdiagnosis, Missing Verification.

03

Proposed STT-Agent-4B outperforms existing models on the benchmark.

Abstract

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Miaow-Lab/STT-Arena
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.