LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar

TL;DR
This paper explores using large language models as both generators and evaluators of time series explanations, proposing a reference-free assessment method and constructing a synthetic benchmark for evaluation.
Contribution
It introduces a novel LLM-based evaluation framework for time series explanations and creates a synthetic benchmark dataset for testing explanation quality.
Findings
Evaluation models reliably rank explanations despite generation failures.
Generation accuracy varies significantly across different query types.
Models perform well in explanation ranking even when their own explanations are incorrect.
Abstract
Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
