Overcoming the Modality Gap in Context-Aided Forecasting

Vincent Zhihao Zheng; \'Etienne Marcotte; Arjun Ashok; Andrew Robert Williams; Lijun Sun; Alexandre Drouin; Valentina Zantedeschi

arXiv:2603.12451·cs.LG·April 23, 2026

Overcoming the Modality Gap in Context-Aided Forecasting

Vincent Zhihao Zheng, \'Etienne Marcotte, Arjun Ashok, Andrew Robert Williams, Lijun Sun, Alexandre Drouin, Valentina Zantedeschi

PDF

1 Models 1 Datasets

TL;DR

This paper introduces a semi-synthetic data augmentation method to improve context-aided forecasting, creating a large dataset that demonstrates the importance of dataset quality over model architecture in performance.

Contribution

The authors develop a semi-synthetic data generation approach for CAF, resulting in a large-scale dataset and showing that dataset quality is key to effective context utilization.

Findings

01

Semi-synthetic pre-training transfers well to real-world data.

02

The large CAF-7M dataset enables better context utilization.

03

Dataset quality, not architecture, limits CAF performance.

Abstract

Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ServiceNow/DoubleCast
model· 10 dl· ♡ 1
10 dl♡ 1

Datasets

ServiceNow/CAF_7M
dataset· 163 dl
163 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.