Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention

Yinbo Sun; Yuchen Fang; Zhibo Zhu; Jia Li; Yu Liu; Qiwen Deng; Jun Zhou; Hang Yu; Xingyu Lu; Lintao Ma

arXiv:2510.21795·cs.CV·October 28, 2025

Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention

Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, Lintao Ma

PDF

3 Reviews

TL;DR

Xihe introduces a hierarchical interleaved block attention mechanism for scalable zero-shot time series modeling, enabling effective multi-scale dependency capture and achieving state-of-the-art results across various model sizes.

Contribution

The paper proposes HIBA, a novel attention architecture for time series models, and develops Xihe, a scalable family of models that excel in zero-shot transfer tasks.

Findings

01

Xihe-tiny outperforms many existing models with only 9.5M parameters.

02

Xihe-max (1.5B) achieves new state-of-the-art zero-shot performance.

03

HIBA effectively captures multi-scale dependencies in time series data.

Abstract

The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper presents the results on the GIFT-Eval benchmark, instead of only using the seven small LTF datasets. 2. The writing is very good and easy to follow. However, the authors tend to use ";" too often. For example, “strong zero-shot capabilities; Although Moirai”, you should replace “;” with “.”. Some other grammar errors: - Some baseline methods are misspelled: “Dlinear” → “DLinear”, “PatchTsT” → “PatchTST” - Line 84: "combining public available datasets" -> "combining publicly availabl

Weaknesses

1. Not enough ablations on the hierarchical structure. 2. Not enough ablations on the K prediction heads, and this part does not seem very novel/scalable. 3. Lack of pretraining details, such as dataset mixing strategies. I feel like the paper in the current form is not ready for acceptance. My main concern is the current ablation studies are not comprehensive enough to highlight the main contribution of this paper, which is to use a hierarchical structure of interleaved intra-block and inter-b

Reviewer 02Rating 2Confidence 5

Strengths

The paper has the following strengths: 1. The presentation is clear, and understandable. 2. The experimental evaluation is performed on a well-established leaderboard (GIFT) which comprises of a significant amount of univariate time series evaluation data. 3. Ablations studies are reported.

Weaknesses

However, the paper has the following weaknesses: 1. The concept of intra- and inter-block attention is not new. In the TSFM literature, similar concepts were proposed in the TSMixer[1] and TTM[2] papers, where they employed mixing instead of attention. How are intra- and inter-block attention conceptually different from intra- and inter-patch mixing? 2. The concept of varying the block length is not novel as well. The TTM paper proposed something which authors called as "Adaptive Patching". How

Reviewer 03Rating 6Confidence 3

Strengths

The Xihe family demonstrates state-of-the-art zero-shot performance on the comprehensive GIFT-Eval benchmark. The fact that the smallest 9.5M parameter model outperforms the majority of existing TSFMs is a very strong result, highlighting the architectural efficiency of HIBA. The paper successfully trains and evaluates a family of models ranging from 9.5M to 1.5B parameters. The results show a clear scaling trend where performance on both CRPS and MASE metrics improves monotonically with model

Weaknesses

W1: The pre-training dataset is a mix of public datasets and synthetic data. While large, the contribution of the data-quality-aware mixing strategy versus the architectural improvements is not explicitly disentangled. It's unclear how much of the performance gain comes from this curated data mix. W2: The implementation details of Hierarchical Block Size is ambiguous. Though authors provide information about the block size in appendix, it is not clear how to configure the block size for intra-

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.