From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Ziming Zhao; ChengAo Shen; Hanghang Tong; Dongjin Song; Zhigang Deng; Qingsong Wen; Jingchao Ni

arXiv:2505.24030·cs.LG·July 11, 2025

From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Zhigang Deng, Qingsong Wen, Jingchao Ni

PDF

Open Access 3 Reviews

TL;DR

This study evaluates the effectiveness of Large Vision Models in time series analysis, finding they excel in classification but face limitations in forecasting, guiding future multimodal research.

Contribution

First comprehensive study assessing LVMs for time series, covering multiple models, datasets, and tasks, with detailed analysis of their strengths and limitations.

Findings

01

LVMs are effective for time series classification.

02

LVMs face challenges in forecasting accuracy.

03

Current LVM forecasters are limited to specific types and exhibit biases.

Abstract

Transformer-based models have gained increasing attention in time series research, driving interest in Large Language Models (LLMs) and foundation models for time series analysis. As the field moves toward multi-modality, Large Vision Models (LVMs) are emerging as a promising direction. In the past, the effectiveness of Transformer and LLMs in time series has been debated. When it comes to LVMs, a similar question arises: are LVMs truely useful for time series analysis? To address it, we design and conduct the first principled study involving 4 LVMs, 8 imaging methods, 18 datasets and 26 baselines across both high-level (classification) and low-level (forecasting) tasks, with extensive ablation analysis. Our findings indicate LVMs are indeed useful for time series classification but face challenges in forecasting. Although effective, the contemporary best LVM forecasters are limited to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

(1) The paper covers a substantial experimental scope (4 LVMs, 8 imaging methods, 18 datasets, 26 baselines) with well-designed ablation studies (RQ1-RQ10). The breadth of analysis is commendable and provides genuine value to the community. (2) The discovery that pre-trained decoders contribute more than encoders in TSF (RQ8) is genuinely interesting and counterintuitive. The analysis of the period-based imaging bias (RQ9) with formal characterization (Lemma 1) provides actionable insights. (3)

Weaknesses

(1) The paper is purely an empirical benchmark study without methodological contributions. While valuable, such studies typically require either exceptional insights or novel proposed solutions. (2) The paper identifies what fails (encoders, long windows) but provides limited mechanistic understanding of why. (3) The claim that forecasting is "low-level" and requires numerical inference needs deeper investigation beyond decoder architecture. (4) The connection to recent multimodal approaches (m

Reviewer 02Rating 6Confidence 3

Strengths

1. It systematic explore large vision models for time series classification and forecasting, covering diverse models, imaging methods, datasets, and baselines to fill existing research gaps. 2. In-depth mechanism analysis reveals key insights and quantifies temporal pattern capture, providing essential support for future optimizations. 3. It identifies optimal task-specific configurations and targeted fine-tuning strategies, enabling efficient real-world implementation. 4. Rigorous benchmark com

Weaknesses

1. The study mentions that Large Vision Models have numerous limitations in time series forecasting tasks. However, leveraging the feature extraction capability of LVMs for TSF and integrating them with time series through cross-modal fusion may help avoid visual limitations and improve overall performance. The current work lacks an investigation into cross-modal fusion involving visual modalities. 2. The study evaluates 8 imaging methods, and from Table 16, UVH achieves the best performance whi

Reviewer 03Rating 4Confidence 4

Strengths

1. First comprehensive study to jointly analyze LVMs across TSC and TSF using diverse image encodings. 2. Provides a valuable reference for researchers exploring multimodal or vision-inspired time-series models.

Weaknesses

1. Most of the forecasting results rely on UVH/MVH imaging applied to relatively periodic datasets. It’s not clear how well these conclusions hold for more irregular, noisy, or non-periodic signals. 2. The claim that decoders matter more than encoders in forecasting is interesting, but the current evidence is mostly correlational. A simple decoder-swapping or partial-finetuning experiment could make this stronger. 3. The authors note that performance drops for longer look-back windows, potential

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting

MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding