TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks
Muyan Weng, Defu Cao, Wei Yang, Yashaswi Sharma, Yan Liu

TL;DR
TemporalBench is a comprehensive benchmark designed to evaluate the true temporal reasoning capabilities of models across multiple domains, revealing that high forecasting accuracy does not necessarily imply robust contextual or event-aware understanding.
Contribution
We introduce TemporalBench, a multi-domain benchmark with a four-tier taxonomy to diagnose and analyze temporal reasoning in models beyond simple forecasting accuracy.
Findings
Existing models show fragmented strengths in temporal reasoning.
High numerical forecasting accuracy does not guarantee contextual understanding.
Models exhibit systematic failure modes under complex temporal conditions.
Abstract
It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Forecasting Techniques and Applications · Time Series Analysis and Forecasting
