Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang

TL;DR
Time-VLM introduces a multimodal framework utilizing vision and language models to enhance time series forecasting by capturing temporal, visual, and textual information, especially effective in few-shot and zero-shot scenarios.
Contribution
The paper presents a novel multimodal approach that combines pre-trained vision-language models with time series data for improved forecasting accuracy.
Findings
Outperforms existing models in few-shot scenarios
Effective in zero-shot time series forecasting
Demonstrates the benefit of multimodal fusion for temporal data
Abstract
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTime Series Analysis and Forecasting · Advanced Computational Techniques and Applications · Stock Market Forecasting Methods
