Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment
Shunyu Wu, Dan Li, Wenjie Feng, Haozheng Ye, Jian Lou, See-Kiong Ng

TL;DR
This paper introduces TSRating, a meta-learning framework leveraging LLM judgments to accurately and efficiently rate the quality of diverse time series data across multiple domains, improving over existing methods.
Contribution
It proposes a unified, cross-domain time series data quality rating framework using LLMs and meta-learning, enhancing adaptability and efficiency over prior domain-specific approaches.
Findings
TSRating outperforms baselines in accuracy and efficiency.
The framework demonstrates strong cross-domain adaptability.
Experimental results on multiple datasets validate effectiveness.
Abstract
High-quality time series (TS) data are essential for ensuring TS model performance, rendering research on rating TS data quality indispensable. Existing methods have shown promising rating accuracy within individual domains, primarily by extending data quality rating techniques such as influence functions and Shapley values to account for temporal characteristics. However, they neglect the fact that real-world TS data can span vastly different domains and exhibit distinct properties, hampering the accurate and efficient rating of diverse TS data. In this paper, we propose TSRating, a novel and unified framework for rating the quality of time series data crawled from diverse domains. TSRating leverages LLMs' inherent ample knowledge, acquired during their extensive pretraining, to comprehend and discern quality differences in diverse TS data. We verify this by devising a series of…
Peer Reviews
Decision·ICLR 2026 Poster
**S1.** This paper is well-organized and easy to follow. **S2.** The task of time series datasets quality evaluation is important in deep time series model training. **S3.** Meta-learning across 9 domains/22 subsets and reuse on unseen datasets is a sensible design for diverse time series.
**W1.** The most critical weakness lies in the evaluation strategy of TSRating. The framework implicitly assumes that time series with strong trends, seasonality, and regular patterns (Figure 4 in the Appendix) represent high-quality data because they are more predictable and easier to learn. Conversely, series with irregular or unpredictable fluctuations (e.g., the red block in the right part of Figure 1) are treated as “bad samples.” However, in real-world applications, data often exhibit irre
1. Proposes an innovative method for time series quality evaluation by employing LLMs as general “pattern evaluators”, moving beyond traditional statistical or contribution-based metrics. 2. Uses meta-learning to effectively address domain heterogeneity, resulting in strong cross-domain generalization and applicability. 3. Clearly defines the evaluation criteria and provides transparent methodology. 4. Demonstrates significant practical relevance through measurable improvements in downstream tas
1. Heavy reliance on LLM judgments may introduce biases or inaccuracies in understanding intrinsic time series patterns, risking propagation of these biases into TSRater. 2. The definition of “quality” along four fixed dimensions implicitly equates high quality with signal clarity or prominence, which may not be optimal for all downstream tasks (e.g., classification vs. forecasting), potentially leading to suboptimal data selection.
1 The paper exhibits strong originality through its synthesis of multiple concepts: applying the LLM-as-a-judge paradigm cross-modally to raw time series signals, formulating a unique knowledge distillation pipeline to train an efficient scoring model from LLM preferences; and being the first to apply meta-learning for learning a cross-domain data quality scoring function. This architectural combination is novel. 2 The TSRating framework is well-designed. The use of pairwise comparisons, Bradle
1 The framework relies on LLM judgments, which may have inherent biases or sensitivity to prompts. While stability measures were taken , potential systematic biases require further discussion. 2 Justification for selecting the four specific criteria:" trend, frequency, amplitude, pattern" could be stronger. Their sufficiency for all TS tasks (e.g., anomaly detection) is unclear, and potential subjectivity exists. 3 Collecting numerous LLM judgments across multiple domains for meta-learning mig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting
MethodsSpatio-temporal stability analysis
