LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Luk\'a\v{s} Eigler; Jind\v{r}ich Libovick\'y; David Hurych

arXiv:2603.09403·cs.CL·March 11, 2026

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Luk\'a\v{s} Eigler, Jind\v{r}ich Libovick\'y, David Hurych

PDF

Open Access

TL;DR

This paper introduces a scalable method using large language models as meta-judges to generate synthetic evaluation datasets, effectively replacing costly human annotations for validating NLP metrics across multiple tasks and languages.

Contribution

It presents a novel framework leveraging LLMs to create synthetic datasets for metric validation, reducing reliance on human judgments and enabling scalable, multilingual evaluation.

Findings

01

Synthetic datasets achieve high correlation with human judgments (over 0.9) in multilingual QA.

02

The approach is effective across multiple NLP tasks like MT, QA, and Summarization.

03

Synthetic validation is a viable, scalable alternative to traditional human-based evaluation.

Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification