VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones

Lefei Shen; Mouxiang Chen; Xu Liu; Han Fu; Xiaoxue Ren; Jianling Sun; Zhuo Li; Chenghao Liu

arXiv:2508.04379·cs.CV·October 13, 2025

VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

PDF

Open Access 1 Models 3 Reviews

TL;DR

VisionTS++ leverages continual pre-training of vision models on time series data, introducing novel encoding and forecasting techniques to achieve state-of-the-art results across diverse datasets.

Contribution

It presents a new cross-modal time series foundation model that bridges modality, variate, and probabilistic gaps through innovative pre-training and encoding strategies.

Findings

01

Outperforms existing TSFMs by 6-44% in MSE

02

Achieves first place in GIFT-Eval benchmark

03

Effective in both in-distribution and out-of-distribution forecasting

Abstract

Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1)…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

(1) first paper that systematically closes the data-range, multivariate and probabilistic gaps when turning an off-the-shelf vision backbone into a competitive TSFM. (2) no new attention layers, no patch re-design; only lightweight heads and input/output converters—easy to reproduce. (3) SOTA on 4 widely used benchmarks (31/62 first places on LTSF, best nMAE on Monash, top CRPS on PF, 1st on GIFT-Eval) with both base and large variants. (4) removing filtering (−7 %), colourisation (−12 %) or mul

Weaknesses

(1) the core idea (TS ➔ image ➔ MAE) is identical; improvements come from three engineering accessories rather than a new modelling principle. (2) no analysis of why pixel-range filtering or random RGB boundaries should be optimal; no guarantee that vision inductive biases align with temporal dynamics. (3) only forecasting; classification, anomaly detection or irregular sampling not tested. (4) no study on (i) #quantile heads h, (ii) alternative change-point or range-based filters, (iii) image s

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper proposes a new vision-model-based time series foundation model, which achieves good forecasting performance across multiple benchmarks. 2. The paper clearly identifies the key challenges of applying vision models to time series analysis, including the data–modality gap and the multivariate–forecasting gap. 3. The overall writing of the paper is clear and well-organized.

Weaknesses

1. The paper presents an incremental improvement over VisionTS, with the proposed modules—vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting—being relatively straightforward. The filtering module performs simple threshold-based filtering; the multivariate conversion resembles prior vision-based time series models such as ViTST (NeurIPS 2023); and the multi-quantile forecasting capability has already been incorporated in most recent TSFMs. 2. Regarding

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper clearly identifies major limitations when applying vision models to time-series forecasting and systematically attempts to address them. 2. Incorporating probabilistic forecasting into the framework is novel and refreshing, extending beyond conventional deterministic designs.

Weaknesses

1. The study only evaluates MAE-based VisionTS++, without validating other vision backbones(SimMIM, BootMAE, etc.). This limits the generality of the proposed framework. 2. The paper does not discuss the computational cost of continual pre-training, raising concerns about training efficiency and scalability.

Code & Models

Models

🤗
Lefei/VisionTSpp
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Forecasting Techniques and Applications · Machine Learning in Healthcare