How to Select Datapoints for Efficient Human Evaluation of NLG Models?
Vil\'em Zouhar, Peng Cui, Mrinmaya Sachan

TL;DR
This paper proposes and evaluates methods for selecting the most informative data points for human evaluation of NLG models, reducing costs while maintaining evaluation accuracy.
Contribution
It introduces novel selectors based on variance, diversity, and Item Response Theory, including source-based estimators when outputs are unavailable.
Findings
Selectors outperform random sampling in efficiency.
Approximately 70% of data suffices for accurate evaluation.
Effective in machine translation and summarization tasks.
Abstract
Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practice for human evaluation. However, randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop and analyze a suite of selectors to get the most informative datapoints for human evaluation, taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Context-Aware Activity Recognition Systems · Intelligent Tutoring Systems and Adaptive Learning
