The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen

TL;DR
This paper uncovers a 'few-layer dominance' phenomenon in large-scale time series models, showing that only a small subset of layers are crucial, and leveraging this insight improves efficiency and accuracy.
Contribution
It identifies the 'few-layer dominance' phenomenon in TS models and proposes a method to retain only the important layers, enhancing performance and speed.
Findings
Retaining 21% of layers improves accuracy by up to 12%.
Speedup of 2.7x achieved by layer pruning.
Method effective across 8 SOTA models and 95% of tasks.
Abstract
Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally…
Peer Reviews
Decision·Submitted to ICLR 2026
S1: This paper points out a problem in existing large-scale time series models: under both the TSFM and LLM4TS paradigms, a larger model does not mean better results. S2: The paper explores multiple factors that influence model performance through three research questions: network scaling, data volume, and data homogeneity. S3: The paper proposes a hypothesis that not all layers contribute significantly to the model's prediction performance; only a small number of layers dominate the final predi
W1: In research question 1 (Section 4.1), the article argues that scaling the backbone network does not improve performance. However, the author may have overlooked the issue of dataset stationarity. In Figure 2, the datasets where performance degrades as the model size increases (e.g., ETT, Exchange) are mostly those with poor stationarity or few data points. In this situation, increasing the network size may cause the model to overfit, reducing performance. In contrast, for datasets with high
- Clear empirical diagnosis of layer redundancy with simple, model-agnostic tooling (importance scoring + prune-and-realign). - Broad benchmarking across architectures and sizes, showing consistent efficiency gains under certain settings. - The proposed pipeline is practical to adopt and may serve as a useful diagnostic baseline for TS models.
- Numerous formatting issues (e.g., Line 221 overlapping lines; Line 463 “thedegree” missing a space); the paper needs thorough proofreading. - The claim that “not all layers are equally important” has already been demonstrated in many tasks (e.g., [1][2]); moreover, time-series forecasting typically does not require large world-knowledge memory, so the contribution is limited in novelty. - The technical contribution is relatively limited: the importance score is a heuristic composition with sev
S1 The paper introduces a concept -- few-layer dominance -- revealing that only a small subset of layers in large Transformer-based time series models actively contribute to learning. This reframes the scaling problem as layer laziness, providing an empirical view of why deeper architectures yield diminishing returns in time series modeling. S2 The paper uses an analytical framework that jointly measures inter-layer representation shifts and intra-layer attention diversity to quantify each laye
W1: A large body of recent work has already investigated structural redundancy and layer importance in large Transformers. This paper primarily transfers those analytical methods to time-series models but does not provide domain-specific insights. Without a deeper connection to time-series dynamics (e.g., temporal correlation, seasonality, or frequency structure), the contribution is more like a direct application of existing LLM findings. [1] Men, Xin, et al. "Shortgpt: Layers in large languag
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Stock Market Forecasting Methods · Time Series Analysis and Forecasting
