Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models
Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary

TL;DR
This paper introduces new definitions and methods for detecting and reducing hallucinations in multi-variate time-series foundation models using diffusion techniques, aiming to improve their reliability and safety.
Contribution
It proposes novel definitions for MVTS hallucination and develops diffusion-based detection and mitigation methods, filling a gap in the current research landscape.
Findings
Open-source MVTS models hallucinate up to 59.5% more than baseline.
Mitigation methods reduce hallucination by up to 47.7%.
Benchmark datasets for relational hallucination levels are introduced.
Abstract
Foundation models for natural language processing have many coherent definitions of hallucination and methods for its detection and mitigation. However, analogous definitions and methods do not exist for multi-variate time-series (MVTS) foundation models. We propose new definitions for MVTS hallucination, along with new detection and mitigation methods using a diffusion model to estimate hallucination levels. We derive relational datasets from popular time-series datasets to benchmark these relational hallucination levels. Using these definitions and models, we find that open-source pre-trained MVTS imputation foundation models relationally hallucinate on average up to 59.5% as much as a weak baseline. The proposed mitigation method reduces this by up to 47.7% for these models. The definition and methods may improve adoption and safe usage of MVTS foundation models.
Peer Reviews
Decision·Submitted to ICLR 2026
This paper provides the first formalization of hallucination in the time-series domain, making an important conceptual and methodological contribution. The definitions of distributional and relational hallucination are clearly motivated by NLP analogies yet appropriately adapted for MVTS. The proposed Combined Error (CE) metric is elegant and computationally efficient, as it reuses a diffusion model’s denoising dynamics to quantify internal consistency without requiring external labels or superv
Despite its novelty, the paper has several weaknesses limiting its theoretical depth and empirical generalizability. The diffusion-based CE metric is heuristically motivated and lacks theoretical grounding linking it to true relational error; no formal proof connects CE to hallucination likelihood beyond empirical correlation. Furthermore, the evaluation setting is limited—all relational datasets are derived from existing benchmarks by appending simple transformations, so the results may not gen
( I have little experience with Time-Series Foundation Models, so my confidence score is low. I kindly ask the AC to assign a lower weight to my review. ) The paper provides a valuable conceptual contribution by defining "hallucination" in the MVTS context, which is critical for model reliability in scientific applications. The creation of "relational datasets" with known ground-truth functions allows for a robust, quantitative validation of the proposed CE metric against the true relational e
( I have little experience with Time-Series Foundation Models, so my confidence score is low. I kindly ask the AC to assign a lower weight to my review. ) The method's main drawback is its reliance on a dataset-specific diffusion model. This verifier must be trained for each target dataset, which increases computational overhead and undermines the zero-shot/few-shot promise of FMs. The proposed mitigation strategy (sampling N=20 times and filtering) is a costly, brute-force approach.
1. This work is the first to systematically explore the hallucination problem in MVTS models from the perspective of generative modeling. The research question is novel and holds strong potential for future studies in the time-series domain. 2. The proposed CE leverages diffusion–reverse diffusion discrepancies to measure latent distributional consistency, offering a degree of causal interpretability. 3. The combination of detection and mitigation modules forms a closed-loop framework, demonstra
1. The paper lacks a theoretical explanation of the mathematical interpretability of the CE metric, as it does not derive its relationship with hallucination intensity from diffusion probability theory or an energy-based perspective. 2. No significance testing is provided, leaving the statistical reliability of the reported results unclear. 3. The adopted MLP-DDPM architecture lacks the capacity to model temporal dependencies. 4. The proposed method relies heavily on the integrity of the trainin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychedelics and Drug Studies · Machine Learning in Healthcare · Ferroelectric and Negative Capacitance Devices
