Timer-XL: Long-Context Transformers for Unified Time Series Forecasting
Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, Mingsheng Long

TL;DR
Timer-XL introduces a unified causal Transformer model for diverse time series forecasting tasks, leveraging a novel TimeAttention mechanism to capture complex dependencies and achieve state-of-the-art results across benchmarks.
Contribution
The paper proposes Timer-XL, a universal long-context Transformer architecture with a new TimeAttention mechanism for improved multivariate time series forecasting.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Effective in zero-shot forecasting scenarios.
Handles non-stationary and multivariate time series with complex dynamics.
Abstract
We present Timer-XL, a causal Transformer for unified time series forecasting. To uniformly predict multidimensional time series, we generalize next token prediction, predominantly adopted for 1D token sequences, to multivariate next token prediction. The paradigm formulates various forecasting tasks as a long-context prediction problem. We opt for decoder-only Transformers that capture causal dependencies from varying-length contexts for unified forecasting, making predictions on non-stationary univariate time series, multivariate series with complicated dynamics and correlations, as well as covariate-informed contexts that include exogenous variables. Technically, we propose a universal TimeAttention to capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches), which is further enhanced by deft position embedding for temporal causality and…
Peer Reviews
Decision·ICLR 2025 Poster
a) The paper studies an interesting problem of long context modeling in the context of time series forecasting. Authors attempt to connect long context to scenarios beyond univariate modeling through next-token style modeling of multivariate and covariate-informed time series. b) Experiments have been conducted on many diverse settings although the experiments themselves have some limitations.
a) While the problem of long context modeling is interesting, the primary weakness of this work is the lack of clarity about the goal and a proper scope. The discussion is confusing and often only loosely relates to the long context setting which appears to be the primary goal. Authors claim that "existing transformers in the time series field crucially encounter the context bottleneck" which is not as critical of a problem as being portrayed here. Such claims require serious empirical justifica
1. The paper proposes multivariate next token prediction for time series forecasting. This paradigm unifies univariate, multivariate, and covariate-informed forecasting, by treating them as a long-context generation problem. 2. This paper introduces TimeAttention, a novel self-attention mechanism for time series data. TimeAttention captures fine-grained intra and inter-series dependencies, preserves causality in forecasting, and incorporates position embeddings. 3. Experiments show that extendin
1. This paper focuses heavily on comparing Timer-XL with other Transformer models, particularly PatchTST, Timer, and lacks a broader comparison with other non-Transformer time series forecasting models. Also, how is Timer-XL compared with some recent LLM-based models? e.g., [1], [2]. 2. This paper doesn't extensively discuss the computational cost of Timer-XL. Though it provides a theoretical derivation, a more detailed analysis of the computational resources required, especially when handling h
1. The manuscript demonstrates a high level of completeness. 2. It provides an effective large-scale framework for advancing large models in time series analysis. 3. The empirical evaluation is comprehensive and promising results are shown.
1. My biggest question lies with Equations 4, 5, 7, and 8. Since Timer-XL is a unified time series framework, these equations include processing steps that sort each time series. For example, with 𝑁 time series, does the order after flattening the sequences significantly affect the causal relationship in Equations 4 and 5? The same question applies to Equations 7 and 8. 2. As a unified time series framework, the author needs to compare it with Moirai in a zero-shot scenario. In Figure 5 of the m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Stock Market Forecasting Methods
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
