TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

Md Atik Ahamed; Mihir Parmar; Palash Goyal; Yiwen Song; Long T. Le; Qiang Cheng; Chun-Liang Li; Hamid Palangi; Jinsung Yoon; Tomas Pfister

arXiv:2604.05364·cs.AI·April 8, 2026

TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister

PDF

1 Repo 1 Datasets

TL;DR

TFRBench is a novel benchmark for evaluating the reasoning abilities of forecasting systems, emphasizing interpretability and causal analysis over traditional accuracy metrics.

Contribution

It introduces a multi-agent framework and reasoning protocol for assessing forecasting models' understanding of dependencies, trends, and external events.

Findings

01

Reasoning-based evaluation improves forecasting accuracy (e.g., 40.2% to 56.6%).

02

Off-the-shelf LLMs struggle with reasoning and domain-specific dynamics.

03

Benchmark promotes interpretable, reasoning-focused evaluation in time-series forecasting.

Abstract

We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://tfrbench.github.io
github

Datasets

AtikAhamed/TFRBench
dataset· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.