When Does Multimodality Lead to Better Time Series Forecasting?

Xiyuan Zhang; Boran Han; Haoyang Fang; Abdul Fatir Ansari; Shuai Zhang; Danielle C. Maddix; Cuixiong Hu; Andrew Gordon Wilson; Michael W. Mahoney; Hao Wang; Yan Liu; Huzefa Rangwala; George Karypis; Bernie Wang

arXiv:2506.21611·cs.CL·October 1, 2025

When Does Multimodality Lead to Better Time Series Forecasting?

Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang

PDF

3 Reviews

TL;DR

This study systematically evaluates when incorporating textual data improves time series forecasting, revealing that benefits depend on model capacity, data characteristics, and alignment strategies, and are not universally guaranteed.

Contribution

The paper provides a comprehensive analysis of conditions under which multimodal (text and time series) models improve forecasting, offering data-agnostic insights across diverse domains.

Findings

01

Multimodal benefits depend on model capacity and data quality.

02

Alignment strategies influence the effectiveness of multimodal models.

03

Gains are more likely with sufficient data and complementary text signals.

Abstract

Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 16 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Our findings reveal that the benefits of multimodality are highly condition-dependent. While we confirm reported gains in some settings, these improvements are not universal across datasets or models. To move beyond empirical observations, we disentangle the effects…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

Quality: The experimental design is thorough, covering 16 datasets, multiple model families (e.g., Chronos, BERT, LLMs), and diverse alignment strategies. The synthetic data approach is particularly strong for isolating key variables. Clarity: The writing is accessible, with clear explanations of methods and results. Visualizations (e.g., scatter plots showing performance trends) effectively communicate complex findings. Significance: The paper provides actionable guidelines for researchers an

Weaknesses

The study is limited to text and time series; excluding other modalities (e.g., images in retail forecasting) may reduce generalizability to broader multimodal settings. While datasets are diverse, they may not capture all real-world challenges (e.g., ultra-long sequences or low-resource domains). Including more extreme cases could strengthen the conclusions. The evaluation of prompting-based methods relies on current LLMs (e.g., GPT-4, Claude), which evolve rapidly; however, this is mitigated

Reviewer 02Rating 6Confidence 4

Strengths

1. The research is underpinned by a clearly defined and compelling motivation. 2. The work itself is pioneering and addresses a problem of considerable importance. 3. The article is exceptionally well-written, and the experimental section is systematically conducted, with results presented in a clear and convincing manner.

Weaknesses

Please refer to the **Questions**.

Reviewer 03Rating 2Confidence 4

Strengths

- The authors evaluate multiple existing methods using two pipelines (alignment-based and prompt-based) across 16 diverse datasets. - The paper provides a comprehensive analysis of different modeling strategies and offers detailed experimental results and insights.

Weaknesses

### 1. **The Definition of “Multimodal” Time Series** I am very concerned about the formulation of **multimodal** time series in this paper. In Lines 034–044, the authors categorize six “MMTS” methods into two types: (I) alignment-based and (II) prompt-based. However, I believe that **multimodal learning inherently implies semantic alignment** between modalities. For instance, I think it difficult to perceive any semantic alignment between numerical time series data and a textual statement s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.