The Authenticity Gap in Human Evaluation
Kawin Ethayarajh, Dan Jurafsky

TL;DR
This paper critically examines the standard human evaluation protocol in NLG, revealing its flaws and proposing a new probabilistic assessment method that better captures true human preferences, especially for open-ended tasks like story generation.
Contribution
It identifies the limitations of Likert scale ratings in reflecting true preferences and introduces the system-level probabilistic assessment (SPA) as a more reliable evaluation protocol for open-ended NLG tasks.
Findings
Likert scales can reverse true preferences in evaluations.
SPA accurately recovers model rankings with statistical significance.
Standard protocols often fail to reflect actual human preferences.
Abstract
Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. Analyzing this standard protocol through the lens of utility theory in economics, we identify the implicit assumptions it makes about annotators. These assumptions are often violated in practice, in which case annotator ratings cease to reflect their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics · Sports Analytics and Performance · Experimental Behavioral Economics Studies
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Attention Dropout · Linear Warmup With Cosine Annealing · Residual Connection
