Plots Unlock Time-Series Understanding in Multimodal Models
Mayank Daswani, Mathias M.J. Bellaiche, Marc Wilson, Desislav Ivanov,, Mikhail Papkov, Eva Schnider, Jing Tang, Kay Lamerigts, Gabriela Botea,, Michael A. Sanchez, Yojan Patel, Shruthi Prabhakara, Shravya Shetty, Umesh, Telang

TL;DR
This paper introduces a method that uses existing vision encoders in multimodal models to analyze time-series data through plots, significantly improving performance and reducing costs compared to raw data input.
Contribution
It demonstrates that visual representations of time-series data enable foundation models to better understand complex, real-world tasks without additional training.
Findings
Up to 120% performance increase on synthetic tasks
Up to 150% performance increase on real-world health tasks
90% reduction in model API costs
Abstract
While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper shows an interesting finding on a few synthetic and real-world tasks. 2. The authors conduct rigorous statistical experiments to evaluate differences between textual and visual presentation of time series inputs.
1. **Writing:** I would encourage the authors to spend some more time improving the writing of the manuscript. For example, there's a lot of forward references to the appendix and supplementary material with key information. This not only makes reading the manuscript harder, but the leaves the reader to wonder about some basic questions, for example how is the synthetic time series generated? The captions should be written so they communicate a story, rather than just the mechanics of the plot.
1. The comparison to human interpretation of numerical versus visual input helped to motivate why this enables a new mode of reasoning in foundation models. 2. On some synthetic and real data tasks, visual interpretation of the data does lead to a notable improvement over text interpretation on many pattern-recognition tasks.
1. It seems like the actual task of recognizing the types of visual patterns achieved by this method are not inherently novel – Section 2 shows that models are already able to parse plots and tables. So the contribution of the paper would be more clear if the authors provided (1) comparisons to actual timeseries baselines to show an improvement over prior work or (2) further justification on why one would ever feel constrained to using a foundation model to interpret time-series data (i.e., how
- The paper introduces a creative idea of using vision encoders to understand time-series data by visualizing it as plots. - Multimodal models for time-series understanding through visualization potentially reduce the token usage, leading to lower API costs. - The experiments cover both synthetic and real-world datasets, providing an assessment of the proposed method's strengths and weaknesses across different contexts.
- The core idea of using multimodal models to interpret time-series data via visual plots may not be sufficiently novel. - The experimental methodology lacks rigorous justification. The paper does not provide enough comparative analysis with well-established baselines or detailed theoretical grounding to support its hypothesis, which weakens its scientific contribution. - The related work section is not well-organized, making it difficult for readers to understand the context and significance of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Semantic Web and Ontologies · Cognitive Science and Education Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding · Softmax · Attention Dropout · Multi-Head Attention
