TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

Muyan Weng; Defu Cao; Wei Yang; Yashaswi Sharma; Yan Liu

arXiv:2602.13272·cs.AI·February 17, 2026

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

Muyan Weng, Defu Cao, Wei Yang, Yashaswi Sharma, Yan Liu

PDF

Open Access 1 Datasets

TL;DR

TemporalBench is a comprehensive benchmark designed to evaluate the true temporal reasoning capabilities of models across multiple domains, revealing that high forecasting accuracy does not necessarily imply robust contextual or event-aware understanding.

Contribution

We introduce TemporalBench, a multi-domain benchmark with a four-tier taxonomy to diagnose and analyze temporal reasoning in models beyond simple forecasting accuracy.

Findings

01

Existing models show fragmented strengths in temporal reasoning.

02

High numerical forecasting accuracy does not guarantee contextual understanding.

03

Models exhibit systematic failure modes under complex temporal conditions.

Abstract

It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Melady/TemporalBench
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Forecasting Techniques and Applications · Time Series Analysis and Forecasting