Towards Neural Scaling Laws for Time Series Foundation Models
Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin,, Shirui Pan

TL;DR
This paper investigates the scaling laws of time series foundation models, comparing encoder-only and decoder-only Transformers across in-distribution and out-of-distribution data, revealing architecture impacts on scalability.
Contribution
It provides the first comprehensive analysis of TSFM scaling laws for both ID and OOD data, highlighting architecture effects and offering practical scaling guidelines.
Findings
Log-likelihood loss scales similarly in ID and OOD settings.
Encoder-only Transformers outperform decoder-only in scalability.
Architectural improvements enhance ID performance but may reduce OOD scalability.
Abstract
Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD) scaling behavior and the influence of model architectures less explored. In this work, we examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. These models are trained and evaluated across varying parameter counts, compute budgets, and dataset sizes. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper flows very well and is easy to follow. 2. While most existing TSFMs for forecasting choose a decoder-only structure, this paper presents new findings on the encoder-only versus decoder-only that contradicts prior findings, and analyzed it from the scope of OOD data.
1. Evaluation setup could be improved. On line 130, "This subset includes test data from the ETTh1-2, ETTm1-2, electricity, and weather datasets." Are these the only datasets that are used as OOD test data throughout all the experiments? To strengthen your findings, I suggest varying the OOD datasets (maybe include a confidence interval), and see if your findings still hold. 2. The paper could benefit from explaining why its findings are different from those of the existing works.
1. The study extends the existing research on scaling laws for TSFMs in a novel direction by investigating their behavior across different data distributions (ID and OOD) and model architectures. While previous works have primarily focused on ID scenarios, this paper breaks new ground by systematically examining the scaling properties of TSFMs in OOD contexts. Furthermore, the comparative analysis of encoder-only and decoder-only Transformers, as well as the inclusion of state-of-the-art TSFMs (
1. Limited scope of model architectures: The study focuses on two main model architectures: encoder-only and decoder-only Transformers. While these are indeed widely used in TSFMs, the paper could benefit from including a broader range of architectures, such as encoder-decoder models (e.g., Seq2Seq) or hybrid models that combine Transformers with other neural network components (e.g., CNN, RNN). Expanding the scope of the investigated architectures would provide a more comprehensive understandin
- Originality: The paper's focus on the scaling laws of TSFMs on OOD data represents a novel and timely contribution. - Quality: The research is well-executed, with a comprehensive methodology that involves training and evaluating a wide range of models across various parameter counts, compute budgets, and dataset sizes. The use of a large and diverse dataset further strengthens the quality of the empirical analysis. - Clarity: The paper is well-written and organized. - Significance: Author's fi
- Line 046: provide citations for neural scaling laws that provide ground for believing that they exist. - Table 1: review the "proportion" row against the "Time points" row. Sales is 0.96% of dataset with 140M points, whereas Web is 0.40% of dataset with 600M points. - Line 149: "Given its proven effectiveness in improving time series forecasting performance (Woo et al., 2023), we adopt RoPE as a replacement for the original Transformer’s positional encoding." ==> my reading of Woo (2023) is th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
