TL;DR
Xihe introduces a hierarchical interleaved block attention mechanism for scalable zero-shot time series modeling, enabling effective multi-scale dependency capture and achieving state-of-the-art results across various model sizes.
Contribution
The paper proposes HIBA, a novel attention architecture for time series models, and develops Xihe, a scalable family of models that excel in zero-shot transfer tasks.
Findings
Xihe-tiny outperforms many existing models with only 9.5M parameters.
Xihe-max (1.5B) achieves new state-of-the-art zero-shot performance.
HIBA effectively captures multi-scale dependencies in time series data.
Abstract
The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents the results on the GIFT-Eval benchmark, instead of only using the seven small LTF datasets. 2. The writing is very good and easy to follow. However, the authors tend to use ";" too often. For example, “strong zero-shot capabilities; Although Moirai”, you should replace “;” with “.”. Some other grammar errors: - Some baseline methods are misspelled: “Dlinear” → “DLinear”, “PatchTsT” → “PatchTST” - Line 84: "combining public available datasets" -> "combining publicly availabl
1. Not enough ablations on the hierarchical structure. 2. Not enough ablations on the K prediction heads, and this part does not seem very novel/scalable. 3. Lack of pretraining details, such as dataset mixing strategies. I feel like the paper in the current form is not ready for acceptance. My main concern is the current ablation studies are not comprehensive enough to highlight the main contribution of this paper, which is to use a hierarchical structure of interleaved intra-block and inter-b
The paper has the following strengths: 1. The presentation is clear, and understandable. 2. The experimental evaluation is performed on a well-established leaderboard (GIFT) which comprises of a significant amount of univariate time series evaluation data. 3. Ablations studies are reported.
However, the paper has the following weaknesses: 1. The concept of intra- and inter-block attention is not new. In the TSFM literature, similar concepts were proposed in the TSMixer[1] and TTM[2] papers, where they employed mixing instead of attention. How are intra- and inter-block attention conceptually different from intra- and inter-patch mixing? 2. The concept of varying the block length is not novel as well. The TTM paper proposed something which authors called as "Adaptive Patching". How
The Xihe family demonstrates state-of-the-art zero-shot performance on the comprehensive GIFT-Eval benchmark. The fact that the smallest 9.5M parameter model outperforms the majority of existing TSFMs is a very strong result, highlighting the architectural efficiency of HIBA. The paper successfully trains and evaluates a family of models ranging from 9.5M to 1.5B parameters. The results show a clear scaling trend where performance on both CRPS and MASE metrics improves monotonically with model
W1: The pre-training dataset is a mix of public datasets and synthetic data. While large, the contribution of the data-quality-aware mixing strategy versus the architectural improvements is not explicitly disentangled. It's unclear how much of the performance gain comes from this curated data mix. W2: The implementation details of Hierarchical Block Size is ambiguous. Though authors provide information about the block size in appendix, it is not clear how to configure the block size for intra-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
