Overcoming the Modality Gap in Context-Aided Forecasting
Vincent Zhihao Zheng, \'Etienne Marcotte, Arjun Ashok, Andrew Robert Williams, Lijun Sun, Alexandre Drouin, Valentina Zantedeschi

TL;DR
This paper introduces a semi-synthetic data augmentation method to improve context-aided forecasting, creating a large dataset that demonstrates the importance of dataset quality over model architecture in performance.
Contribution
The authors develop a semi-synthetic data generation approach for CAF, resulting in a large-scale dataset and showing that dataset quality is key to effective context utilization.
Findings
Semi-synthetic pre-training transfers well to real-world data.
The large CAF-7M dataset enables better context utilization.
Dataset quality, not architecture, limits CAF performance.
Abstract
Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
