The Perils of Using Mechanical Turk to Evaluate Open-Ended Text   Generation

Marzena Karpinska; Nader Akoury; Mohit Iyyer

arXiv:2109.06835·cs.CL·September 15, 2021

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Marzena Karpinska, Nader Akoury, Mohit Iyyer

PDF

TL;DR

This paper critically examines the reliability of Mechanical Turk for evaluating open-ended text generation, revealing significant limitations and proposing better calibration methods for crowdworker judgments.

Contribution

It highlights reproducibility issues in current evaluation practices and demonstrates that showing references improves crowdworker judgment accuracy.

Findings

01

AMT workers struggle to distinguish generated from human text

02

Showing references improves judgment calibration

03

Reproducibility issues due to lack of task reporting

Abstract

Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.