How to Select Datapoints for Efficient Human Evaluation of NLG Models?

Vil\'em Zouhar; Peng Cui; Mrinmaya Sachan

arXiv:2501.18251·cs.CL·June 3, 2025

How to Select Datapoints for Efficient Human Evaluation of NLG Models?

Vil\'em Zouhar, Peng Cui, Mrinmaya Sachan

PDF

Open Access 1 Repo 7 Models

TL;DR

This paper proposes and evaluates methods for selecting the most informative data points for human evaluation of NLG models, reducing costs while maintaining evaluation accuracy.

Contribution

It introduces novel selectors based on variance, diversity, and Item Response Theory, including source-based estimators when outputs are unavailable.

Findings

01

Selectors outperform random sampling in efficiency.

02

Approximately 70% of data suffices for accurate evaluation.

03

Effective in machine translation and summarization tasks.

Abstract

Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practice for human evaluation. However, randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop and analyze a suite of selectors to get the most informative datapoints for human evaluation, taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zouharvi/subset2evaluate
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety · Context-Aware Activity Recognition Systems · Intelligent Tutoring Systems and Adaptive Learning