TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness

Zhiyuan Zhao; Juntong Ni; Shangqing Xu; Haoxin Liu; Wei Jin; B. Aditya Prakash

arXiv:2506.06482·cs.LG·March 26, 2026

TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness

Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash

PDF

Open Access 3 Reviews

TL;DR

TimeRecipe introduces a comprehensive benchmarking framework that evaluates individual components of time-series forecasting models, providing insights and recommendations to improve model design and performance across diverse scenarios.

Contribution

It systematically assesses module-level effectiveness in time-series forecasting, revealing design insights and outperforming existing methods through extensive experiments.

Findings

01

Exhaustive exploration improves forecasting accuracy.

02

Specific design choices are linked to different forecasting scenarios.

03

Toolkit recommends suitable architectures based on empirical insights.

Abstract

Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

The paper has conducted an extensive survey of reusable components and then performed a systematic study. The Table 2 is an output of such extensive study.

Weaknesses

- Given a paper submitted on learning time series and dynamical systems, I feel the paper is more suitable for the benchmark and dataset track. Thus, I have started looking at the paper from a benchmarking and experimental task perspective. Why are the foundation models not part of this work? - Motivation. If I am a developer, how can I consume the outcome of your study? For example, can you provide a case study on how tord- get a leaderboatopping agent on GiftEval? GiftEval is a time series

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well written and easy to understand. 2. The mapping from measured properties to module choices is an interesting idea and worthy of investigation. 3. The training-free selector is a pragmatic contribution that can reduce exploration cost.

Weaknesses

1. Regularization, optimization, schedulers, and data augmentation are not systematically modularized, though they often rival architecture in impact. The historical-window length is also unclear, despite its significant effect on performance. 2. The choice of datasets and prediction horizons (e.g., 720) has been criticized by researchers as impractical in real-world settings (https://cbergmeir.com/talks/bergmeir2024NeurIPSInvTalk.pdf), which weakens the reliability of the conclusions and sugges

Reviewer 03Rating 6Confidence 4

Strengths

1、The paper introduces a new paradigm for time-series forecasting benchmarking by breaking down models into five core modules and systematically benchmarking their combinations. 2、There are quite a few nice illustrations. 3、 This work focuses on an important problem that could have real-world applications. 4、 The figures and tables used in this work are clear and easy to read.

Weaknesses

1、While the coverage of LTSF, PEMS, and M4 datasets is excellent, novel datasets introduced (e.g., unemployment forecasting from Time-MMD) are only briefly mentioned and lack rigorous description (see Section 4.2 and Appendix B). For maximal transparency, the properties, preprocessing, and evaluation setup should be as detailed for these new datasets as for the standard ones. 2、While Table 2 and Figure 1 are helpful, many of the empirical summaries require close reading to decipher key findings

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Traffic Prediction and Management Techniques · Machine Learning in Healthcare