TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister

TL;DR
TFRBench is a novel benchmark for evaluating the reasoning abilities of forecasting systems, emphasizing interpretability and causal analysis over traditional accuracy metrics.
Contribution
It introduces a multi-agent framework and reasoning protocol for assessing forecasting models' understanding of dependencies, trends, and external events.
Findings
Reasoning-based evaluation improves forecasting accuracy (e.g., 40.2% to 56.6%).
Off-the-shelf LLMs struggle with reasoning and domain-specific dynamics.
Benchmark promotes interpretable, reasoning-focused evaluation in time-series forecasting.
Abstract
We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
