TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness
Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash

TL;DR
TimeRecipe introduces a comprehensive benchmarking framework that evaluates individual components of time-series forecasting models, providing insights and recommendations to improve model design and performance across diverse scenarios.
Contribution
It systematically assesses module-level effectiveness in time-series forecasting, revealing design insights and outperforming existing methods through extensive experiments.
Findings
Exhaustive exploration improves forecasting accuracy.
Specific design choices are linked to different forecasting scenarios.
Toolkit recommends suitable architectures based on empirical insights.
Abstract
Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the…
Peer Reviews
Decision·ICLR 2026 Poster
The paper has conducted an extensive survey of reusable components and then performed a systematic study. The Table 2 is an output of such extensive study.
- Given a paper submitted on learning time series and dynamical systems, I feel the paper is more suitable for the benchmark and dataset track. Thus, I have started looking at the paper from a benchmarking and experimental task perspective. Why are the foundation models not part of this work? - Motivation. If I am a developer, how can I consume the outcome of your study? For example, can you provide a case study on how tord- get a leaderboatopping agent on GiftEval? GiftEval is a time series
1. The paper is well written and easy to understand. 2. The mapping from measured properties to module choices is an interesting idea and worthy of investigation. 3. The training-free selector is a pragmatic contribution that can reduce exploration cost.
1. Regularization, optimization, schedulers, and data augmentation are not systematically modularized, though they often rival architecture in impact. The historical-window length is also unclear, despite its significant effect on performance. 2. The choice of datasets and prediction horizons (e.g., 720) has been criticized by researchers as impractical in real-world settings (https://cbergmeir.com/talks/bergmeir2024NeurIPSInvTalk.pdf), which weakens the reliability of the conclusions and sugges
1、The paper introduces a new paradigm for time-series forecasting benchmarking by breaking down models into five core modules and systematically benchmarking their combinations. 2、There are quite a few nice illustrations. 3、 This work focuses on an important problem that could have real-world applications. 4、 The figures and tables used in this work are clear and easy to read.
1、While the coverage of LTSF, PEMS, and M4 datasets is excellent, novel datasets introduced (e.g., unemployment forecasting from Time-MMD) are only briefly mentioned and lack rigorous description (see Section 4.2 and Appendix B). For maximal transparency, the properties, preprocessing, and evaluation setup should be as detailed for these new datasets as for the standard ones. 2、While Table 2 and Figure 1 are helpful, many of the empirical summaries require close reading to decipher key findings
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Traffic Prediction and Management Techniques · Machine Learning in Healthcare
