Using Mechanical Turk to Build Machine Translation Evaluation Sets
Michael Bloodgood, Chris Callison-Burch

TL;DR
This paper explores using Amazon's Mechanical Turk to create cost-effective machine translation test sets, demonstrating that these sets are comparable in quality to professionally-made ones for evaluating system performance.
Contribution
It introduces a method for efficiently building MT test sets via MTurk and validates their effectiveness compared to traditional professional methods.
Findings
MTurk test sets are significantly cheaper to produce.
MTurk test sets produce similar evaluation results as professional sets.
Cost-effective approach for expanding MT evaluation resources.
Abstract
Building machine translation (MT) test sets is a relatively expensive task. As MT becomes increasingly desired for more and more language pairs and more and more domains, it becomes necessary to build test sets for each case. In this paper, we investigate using Amazon's Mechanical Turk (MTurk) to make MT test sets cheaply. We find that MTurk can be used to make test sets much cheaper than professionally-produced test sets. More importantly, in experiments with multiple MT systems, we find that the MTurk-produced test sets yield essentially the same conclusions regarding system performance as the professionally-produced test sets yield.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Natural Language Processing Techniques
