LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
Luk\'a\v{s} Eigler, Jind\v{r}ich Libovick\'y, David Hurych

TL;DR
This paper introduces a scalable method using large language models as meta-judges to generate synthetic evaluation datasets, effectively replacing costly human annotations for validating NLP metrics across multiple tasks and languages.
Contribution
It presents a novel framework leveraging LLMs to create synthetic datasets for metric validation, reducing reliance on human judgments and enabling scalable, multilingual evaluation.
Findings
Synthetic datasets achieve high correlation with human judgments (over 0.9) in multilingual QA.
The approach is effective across multiple NLP tasks like MT, QA, and Summarization.
Synthetic validation is a viable, scalable alternative to traditional human-based evaluation.
Abstract
Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
