TempusBench: An Evaluation Framework for Time-Series Forecasting

Denizalp Goktas; Gerardo Ria\~no-Brice\~no; Alif Abdullah; Aryan Nair; Chenkai Shen; Beatriz de Lucio; Alexandra Magnusson; Farhan Mashrur; Ahmed Abdulla; Shawrna Sen; Mahitha Thippireddy; Gregory Schwartz; Amy Greenwald

arXiv:2604.11529·cs.LG·April 17, 2026

TempusBench: An Evaluation Framework for Time-Series Forecasting

Denizalp Goktas, Gerardo Ria\~no-Brice\~no, Alif Abdullah, Aryan Nair, Chenkai Shen, Beatriz de Lucio, Alexandra Magnusson, Farhan Mashrur, Ahmed Abdulla, Shawrna Sen, Mahitha Thippireddy, Gregory Schwartz, Amy Greenwald

PDF

1 Repo 1 Datasets

TL;DR

TempusBench is a comprehensive, open-source evaluation framework for time-series foundation models, addressing current limitations by providing new datasets, benchmark tasks, standardized evaluation, and visualization tools.

Contribution

It introduces a novel evaluation framework with new datasets, benchmark tasks, and a standardized pipeline, enhancing fairness and interpretability in TSFM assessment.

Findings

01

Provides new datasets not used in pretraining

02

Includes benchmark tasks beyond traditional metrics

03

Offers a standardized hyperparameter tuning protocol

Abstract

Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, existing evaluation frameworks comprise benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, these frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks, such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Smlcrm/TempusBench
github

Datasets

Smlcrm/tempus_bench
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.