Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting
Defu Cao, Michael Gee, Jinbo Liu, Hengxuan Wang, Wei Yang, Rui Wang, Yan Liu

TL;DR
This paper introduces a novel approach where large language models act as intelligent judges to evaluate, explain, and coordinate ensembles of time series models, improving forecasting accuracy and interpretability.
Contribution
It proposes finetuning LLMs with SHAP-based scores to enable causal interpretation of ensemble weights and uses multi-turn conversations for adaptive forecasting strategy refinement.
Findings
Outperforms existing models on GIFT-Eval benchmark
Achieves state-of-the-art results on CRPS and MASE metrics
Demonstrates effective interpretability and coordination of model ensembles
Abstract
The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM's inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn…
Peer Reviews
Decision·Submitted to ICLR 2026
Agentic solutions for time series forecasting is a novel and undiscovered area and is currently lacking in current benchmarks which are heavily populated by foundation or deep learning models. Building the agentic workflow on top of the ensemble backbone is intuitive as it both lies on a strong foundation yet gives enough space to the agent to make decisions through adjusting weights. The paper thoroughly explains the building structure of the agent strenghtened with equations and visuals wher
The experiments miss a critical ablation isolating the key contribution—the LLM’s control over ensemble weights. The current ablations focus on design choices for the agent itself, but not on how much value the LLM brings compared to a simple ensemble where weights are optimized without LLM intervention. Without this, it’s hard to assess the true benefit of the proposed architecture. Moreover since the approach fundamentally builds on ensembling, it should be benchmarked against a broader suite
- The idea of positioning an LLM as a meta-optimizer or reasoning controller for existing time-series foundation models is interesting and timely. - The work clearly identifies the limitations of direct LLM forecasting and attempts to use reasoning for ensemble coordination instead of numerical prediction. - The introduction of the Temporal Incompatibility Index is conceptually appealing, and could inspire further study on heterogeneity in time-series regimes. - The paper reports improved emp
- The training process for the LLM agent is underexplained. It remains unclear how the SFT training data are constructed: who wrote or generated the reasoning traces, how ground-truth ensemble weights were determined, and what constitutes a “correct” reasoning trajectory. Without this, reproducibility and credibility of the training pipeline are limited. - The paper does not convincingly justify why LLM-based reasoning is necessary when all metrics (MAE, MSE, etc.) are already computable and co
- The paper introduces a novel framework that leverages LLMs as orchestrators that coordinate multiple time series foundation models through reasoning-based ensemble optimization. - The framework is theoretically supported, showing that ensembles outperform a single model. - Experiments are comprehensive, covering diverse datasets and domains. Ablation studies and sensitivity analyses are thoroughly conducted. Results are strong across various metrics.
- The scalability and latency of multi-turn reasoning and optimization are not analyzed. The computational cost of repeated optimization, SHAP evaluation, etc. may be high. It's unclear whether this framework is practically useful. - The interpretability claims could be further validated throgh user studies or quantitative metrics. - The framework's robustness to noise is not evaluated. It would be helpful to test on noisy datasets such as stock prices or socia media traffic, where temporal si
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Forecasting Techniques and Applications · Topic Modeling
