The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
Marzena Karpinska, Nader Akoury, Mohit Iyyer

TL;DR
This paper critically examines the reliability of Mechanical Turk for evaluating open-ended text generation, revealing significant limitations and proposing better calibration methods for crowdworker judgments.
Contribution
It highlights reproducibility issues in current evaluation practices and demonstrates that showing references improves crowdworker judgment accuracy.
Findings
AMT workers struggle to distinguish generated from human text
Showing references improves judgment calibration
Reproducibility issues due to lack of task reporting
Abstract
Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
